Ruminations #10: "Moore's Law was kickin' and then it wasn't"
My trip to SF, NeurIPS and investing in AI
Happy holidays!
I just got back from 1.5 weeks in the U.S. where I went to Modular AI’s developer day, ModCon, and NeurIPS in New Orleans.
I met with almost 60 ML developers, researchers, founders and investors. I was going to share all of my notes in this post and realised it was way too much, so I’ll share it in segments. Today’s segments
On LLMs becoming native agents
On scaling larger and larger models
Up on the Square Peg blog today is also a summary of my colleague Yonatan and I’s presentation on AI at our recent investor summit.
We talk about how we’re framing up different AI startup opportunities, some of the things we’re thinking about when evaluating investments in the space and what we’re excited for next.
Lastly, if you want a holiday project that involves playing around with Custom GPTs / OpenAI’s Assistants API, I wrote a Python script to connect a retrieval assistant to Slack so that I could ask a bot for file links without leaving Slack. If you want to try and do the same, I’m happy to share my code. It’s quite simple.
If you’re building over the break, let me know what you’re cooking up!
Content we’ve enjoyed
Ben shared this article on OpenAI’s engineering:
An interesting look under the hood at OpenAI engineering, product and design teams, and how they’ve achieved such a rapid product release cadence.
Ben has long enjoyed this newsletter - check it out!
Greylock shared their perspective on investing in Vertical AI. They focus on professional services, financial services and healthcare so if you’re in those spaces definitely give it a read.
Maria shared this piece about OpenAI suspending the ByteDance (TikTok parent company) account after it used GPT to train its own AI.
Note from Casey:
Open source players have been doing this as well - it was a topic of conversation when I was last in Tel Aviv. Using another model's outputs to train a new model is called "knowledge distillation" and people believe it's one of the keys to developing smaller models that are still very powerful.
Notes from my trip #1
On LLMs becoming native agents:
I asked many people about my skepticism of being able to build a technical moat around an agents product. Below is the thinking I shared, I’d love feedback!
Everyone is very focused on building agent products right now, but I’ve maintained a certain skepticism around agent products.
I think of an agent as something that can perform a task on behalf of someone or something else.
My skepticism is this: the Toolformer paper demonstrated that today’s large language models can learn to use APIs to call functions and, subsequently, perform certain tasks.
Add to this the fact that LLMs are increasingly being used as reasoning engines instead of just as sources of memorized information. The reasoning capabilities of the underlying model are being used to determine what should be retrieved to respond to a prompt.
So these models can:
Learn to use APIs to complete tasks
Reason well enough to determine what they need to collect to complete a certain set of tasks
These two points suggest to me that models will, over time, further develop native agent capabilities, which could make a large amount of today’s agent-specific engineering work redundant.
I’m not suggesting that all work today to develop agent products is redundant - there’s something to be said for capturing customers early and you can do that by building now. I would just be thinking about what you’re truly differentiating on over time.
On scaling larger and larger models:
Although Mark Zuckerberg declared 2023 as Meta's Year of Efficiency, I expect that 2024 could be AI's Year of Efficiency.
This year Sam Altman said that models are expected to get smaller from here on out, researchers I spoke to at both Meta and Google believe that there could still be a large performance increase to come from foundation models if they can get enough compute (GPUs) to train them ever-larger.
This might sound obvious, but there's been contention around whether we have reached a limit of what we could accomplish simply through scaling compute.
Researchers haven't been able to fully explore this idea. These large companies are trying to free up GPUs where they can and either purchase or produce more to test this hypothesis. This year alone Meta ordered an additional ~$7.5B worth of H100 GPUs from NVIDIA, purely for internal use.
One OpenAI researcher suggested to me that there's tension between OpenAI researchers and Microsoft researchers because the latter’s work has had to take a backseat to enable OpenAI team to take more of the available GPUs for their work.
Historically, the number of transistors on a chip doubling every two years (Moore's Law) has been a large driver of increased chip performance and hence improvements in compute speed and price. This has facilitated some of the breakthroughs we see today, such as the creation of ChatGPT.
Now that Moore's Law is "over", chip architecture is likely the biggest driver of increases in compute speed (aka FLOPS). This may lead traditionally-overlooked chip types, like FPGA (Field Programmable Gate Array), to come to the fore.
While the biggest players are racing each other to AGI they and others are also trying to develop smaller models as they know that's going to be the most cost-effective option for most businesses.
They're building smaller models that are capable across general tasks (using techniques like knowledge distillation) and smaller models that are specialised but leverage what we've learned from building large foundation models.
On the “smaller models” team is Björn Ommer, who led the original release of Stable Diffusion. At NeurIPS he made a passionate case for focusing efforts on enabling smaller models instead of just continuing to scale. Some of the points he made:
In the last 5 years we’ve seen a 15x increase in model size. That growth outstrips the growth in compute power by a factor of 9x. It’s unsustainable to keep focusing on scaling.
Scaling diminishes competition at the model layer. Only so many players can afford to train the largest models.
For every doubling of model size, the training data set should also be doubled. This means we may run out of data.
The larger the model, the more difficult it may become to capture the “long tail” of knowledge.
Models should be able to be run on consumer hardware, specifically GPUs with less than 10GB VRAM.
Intelligence comes from learning from finite resources, which is what our brains have done.
To be clear he wasn’t advocating so much for people to stop trying to increase the scale of models, but rather for researchers to turn more of their attention to techniques to make these models effective at smaller sizes.
At both NeurIPS and in talking to folks in industry, I sensed an enormous focus on reaching much greater efficiency than what we have today. Given how GPU-constrained the tech world is and that we're in a more cost-conscious environment, they know they can't scale expensive architectures.
This is especially true if the AI bubble loses momentum next year (likely); investors will be less likely to foot the bill for high variable cost AI businesses. Many VCs need to preserve their capital at the moment.
In the (paraphrased) words of Australian researcher Jeremy Howard: the race to build the biggest models has led to the development of very inefficient architectures. It likely can’t continue, especially with multi-modality.
If you celebrate Christmas or take time off in this period, enjoy!
All the best,
Casey
Am curious to see google gemini-nano and apple hardware partnership. Nice coming here later. Watching the predictions and trends towards distillation and smaller. Instead of just bigger and faster. Then hear google coming out with nano. Looking forward to seeing its capabilities. Multimodal from scratch and running at the edge on apple silicon.
Great update Casey! I have been thinking about the end of Moore's law too. You mentioned "Now that Moore's Law is "over", chip architecture is likely the biggest driver of increases in compute speed (aka FLOPS). This may lead traditionally-overlooked chip types, like FPGA (Field Programmable Gate Array), to come to the fore."
I am curious if you have a sense of whether LLM and other AI workloads are limited by the end of Moore's law given that they are running on GPUs vs. CPUs? Perhaps its a dumb question, but are GPUs earlier in their development curve vs. CPUs?