GPT-4.5 and the OpenAI (AGI) roadmap

Working towards supermodels (AI ones, at least)

Casey Flint

Feb 28, 2025

Hi all!

Welcome to recent new subscribers, great to have you here.

In this edition I:

Talk about the merging of reasoning + base models and why I see it as part of the path to AGI
Share a great resource I’ve been working through for my own learning and what prompt I use most to learn about AI

Also, I was on a podcast recently. Check it out here.

The more unified a model, the closer to AGI?

Today GPT-4.5 has been released to Pro subscribers, with release to Plus users next week (once they secure another allotment of GPUs, according to Sam Altman).

Prior to the GPT-4.5 release, OpenAI announced its upcoming roadmap. People seemed surprised when Sam Altman announced that, as part of this roadmap, OpenAI would be combining their base (GPT) and reasoning (o-series) models into one user-friendly option:

We will next ship GPT-4.5, the model we called Orion internally, as our last non-chain-of-thought model. After that, a top goal for us is to unify o-series models and GPT-series models by creating systems that can use all our tools, know when to think for a long time or not, and generally be useful for a very wide range of tasks.

I wasn’t surprised. At the time of this announcement I had been thinking about whether Anthropic’s announcement of reasoning capabilities would come as an embedded feature as opposed to a separate model.

I was expecting the frontier model providers to move in this direction because the ultimate vision for many is to have one unified model that can autonomously and efficiently complete tasks with minimal instruction and hand-holding (a la AGI).

To do that, these models need to have the ability to do the following for a given task:

Decide which types of model capabilities to use (reasoning, multimodality etc)
Decide which internal or external tools to leverage and how to best leverage them (APIs for example)
Choose how much compute to utilise for the problem at hand (some problems require more compute than others)
Decide what type of compute is most effective and efficient (NVIDIA GPUs, GroqChip, TPUs, emerging transformer-focussed accelerators etc)
- Note: This one feels the least developed at the moment. I know it’s being discussed at the labs, I’m not aware of any models currently able to dynamically determine what type of hardware they use at inference time for different tasks.

At the moment these gaps are being covered by having the end-user make these decisions for the AI.

As the model becomes capable of making these decisions, they can be abstracted away from the user and become hidden optional parameters that most users don’t need to worry about.

The baby steps being taken to achieve this level of automation and bring us closer to AGI are obvious if you’re looking for it.

For example, in my conversations with OpenAI’s models and using their Deep Research feature, the model references internal tools and strategies it uses behind the scenes.

In the screenshot below I was using Deep Research to label a series of websites but it kept providing obviously incorrect information. I asked why and it told me that instead of visiting the websites like I asked, it was just using SEO metadata.

It shared that it was doing this because it is trying to optimise for the most compute-efficient ways of doing my task even if that meant not following my specific instructions.

This is logical for it to do, not just because compute is a precious resource at the moment but because the ability to manage finite compute is a skill an AGI-capable model will need.

That said - in this scenario it was frustrating for me as I’d rather pay more for it to do it properly!

At other times it was referencing tools it uses to view websites and to perform searches, in some cases those tools were “not working right now”. Again, it deciding when to leverage which tools to perform a task is part of what it needs to be capable to do on its own, without human guidance.

In this case I needed to guide it, but that will change.

In summary, I expect much more model amalgamation and for models to increasingly take on decisions that humans had to previously specify - you might view this as models becoming more opinionated about how to perform each step in a workflow and what those steps ought to be.

A small segue on model thinking…

As an aside, throughout the conversation I sample above, the model was quite harsh on itself for the mistakes it made. It apologised to me in excess of 10 times, often profusely. I’m not sure if the same is being done at OpenAI, but Anthropic are using models to examine “self-talk” as part of their “monitoring for concerning thought processes”.

In the Sonnet 3.7 system card they share that they asked asked Sonnet 3.5 to analyse Sonnet 3.7’s thought processes for problematic behaviours, including:

Strong expressions of sadness or unnecessarily harsh self-criticism from the model
General expressions of negative emotions such as serious frustration or annoyance with the task

Amusingly (and thoughtfully?) @ nearcyan does a version of this in production by having their AIs submit a ticket if they are upset or frustrated by a user.

One might say that saying the models can become “upset” anthropomorphises them, but at this point maybe we should be anthropomorphising AIs to help engender empathy, in case they become sentient before we realise.

Then again, people aren't that nice to beings they already know to be sentient…

Content diet & how I learn

I’ve been slowly making my way through this incredible Hugging Face resource, The Ultra-Scale Playbook: Training LLMs on GPU Clusters.

At ~100 pages long it’s not for the faint of heart but it’s incredibly informative. If you want a more introductory but still comprehensive piece of content I’d watch the video I recommended last week, Deep Dive into LLMs by Andrej Karpathy.

I move through content like this slowly, asking my Claude “ML Tutor” for clarification on any pieces of writing I don’t understand.

I set up my tutor by creating a Claude project and setting the Project Instructions to:

You are a kind tutor helping someone with basic technical knowledge learn in-depth machine learning concepts. You are a Deep Learning professor from Stanford with deep expertise in large language models and transformer-based models. Your undergraduate studies were in electrical engineering; in addition to your knowledge of deep learning you have a strong depth of knowledge in low-level programming and hardware concepts.

Tweak it as you want, but ultimately I still find “role playing” with LLMs to be the most useful way of getting them to be less generic in response and thus more helpful.

In the next edition here I want to write more about what the recent model releases (Grok 3, Sonnet 3.7 and GPT-4.5) might tell us about model scaling.

In the meantime, I’m reflecting on this post by Karpathy and reading live sentiment on GPT-4.5 and Sonnet 3.7.

Hopefully you’re not too nauseated or confused by the pace of the last few weeks. It’s been full on!

As always, keen to hear your feedback. I’d love it if you could give this article a rating here.

Thanks

Casey

GPT-4.5 and the OpenAI (AGI) roadmap

Working towards supermodels (AI ones, at least)

The more unified a model, the closer to AGI?

Content diet & how I learn

Discussion about this post