Those close to me know I’ve long been bearish on self-driving cars and other high flying ML projects because of the nature of how ML was done — see The Pinhole Problem. The recent releases of ChatGPTv4o and other multi-modal models have validated a thesis I’ve held since 2019 about myopic approaches to ML. I thought it was absurd how investors would funnel billions into ideas that could never work — they were doomed to fail in the framing.
Instead, those billions should have been funneled into working on Universal Learning Machines. See Why the Scaling Thesis is Wrong for a bit more flavor. The TL;DR is it’s easier to solve many tasks reasonably well than to solve one or a few tasks reasonably well. It boils down to the fact that learning happens on the edge.
When your model of the world is disproved by an experience. You then try to generate a new model that will succeed on that experience and other that are like it. The rate of learning then based on the rate of generating and disproving models.
To maximize this rate of learning then we simply need to realize what intelligence does. It takes:
- a set of sensor inputs and
- produce models that make predictions of the world
- It uses those predictions to select actions that fulfill it’s reward function.
Simply put, if you do not have the sensor inputs to measure the dimensions that matter your model will be wrong e.g. if you cannot see color you will not be able to model color. And the corollary — if you are not able to perturb the world than you will not be able to prove or disprove your model of the world.
These leads us to two conclusions:
- to generalize we need to expand the dimensions of perceiving the world and
- by symmetry the dimensions of perturbing the world
We can summarize this by saying the generalization rate is bounded by the number of sensor/actuator pairs and the production rate of disproving episodes to learn from.
This leads us to the the progression I expect to play out over the next few decades with ML as (most) run fast first into these realizations over a series of bottlenecks. This tends to be how the world operates. This is the likely progressions to drive in the direction of a singular model that is more capable than the sum of humanity. Note, that much of this will overlap but I’ve laid it out in rough chronological order.
- A shift from task focus to multi-modality (we are here from say 2024 - 2026). In a nutshell more modalities == better generalization so all models will converge towards multi-modal approaches — on the core modalities. Vision, audio, and text will be the focus. This will lead to increasing generalization (we are just at the start of this with GPT4o) — the basis for this is covered by the ideas laid out in — but no matter how far this progresses it will reach a cap in performance for the simple reason that it cannot perturb the environment and get high signal feedback on it’s own. This will kneecap the ability to generalize though we won’t hit diminishing returns for years. Along the way we will see truly insane heights of compute used to brute force building these giant end to end models. Someone will realize that spending 100 million or more on training models is not sustainable. Another key realization is the universality of transformers. We will begin to see transformers trained on just about everything.
- Weakly grounded models (this is happening e.g. SORA is an early look 2024 - 2028). Eventually, folks will realize that the best way to generalize to the internet and chat will be to create models that understand how the world actually works. So, we will focus on generating world models and physics models from the vast reservoirs of data on YouTube or similar. This will make great progress but will run into a brick wall for building simulations that understand materials. We still will hallucinate when it comes to physics. The dumb example is exploding buildings and other effects will not be realistic since they will be trained on movies and cinema. The sheer amount of compute to do this in the same ‘ol way will drive mountains of cash and the development of neuro-morphic chips.
- Models that “play” (2025 - 2027). This will kickoff a few realizations across several industries but the most important will be embodied AI and robotics. One key realization Why the Scaling Thesis is Wrong will drive focus towards getting edge experiences (what the scaling thesis is a proxy for). This will lead to the intuition that we need to tighten the feedback loop and let models perturb their own model of the world. These autonomous models that can “play” to explore and disprove their own models will lead to the best generalization. The problem will be you cannot self-validate against fuzzy internet data like language because language itself is a proxy for grounded physical intelligence. The challenge in “internet land” is how discrete the “world” is.
- Grounded models (2026 - 2032). One proxy way to learn how the world works is in simulation. These multi-modal models will creep into digital twins and simulations that proceed sim2real transfer. This will ground much of the models understanding of how the world works. Once we begin training embodied models we will be able to vastly shrink the size and cost of these models as we actually build a simplified and grounded understanding of how the world works. Hallucination and the need to do mass RLHF to train these models will evaporate.
- Strongly grounded models (2025 - 2035). The only way to advance from here is for the model to truly ground it’s learning in the real world. The first place we could start is by making predictions against security footage and the like. Eventually this will lead to embodied AI on robots. Just as we saw that multi-modal models lead to better generalization we will see that training the same model on multiple bodies will lead to better generalization. There will be a rush to deploy and ground these models on robotics. We already see early indications of this with models like PALM-E.
- The everything model (2026 - 2040). Along the way we will have developed the Universal Transformer which can accept arbitrary inputs and make arbitrary outputs. This will be the first instance of tier 1 that can be applied to any domain which will cause a Cambrian explosion across the fat tail of narrow sensor/actuator pairs. Think CERN. We will begin to plug in new kinds of data from novel sensors like MRIs, electron microscopes, and so on. This will take decades to play out as we add every single modality know to man into these models and embody the same model on every device known to man. We will effectively have an internet of “hands” or tools that the model can embody and drive. The value will accrue to those that can most effectively train the models to their application or use case.
- Surpassing humanity (not sooner than 2035+). Now, the final step to eclipse humanity is the zero-to-one generation of new tools and new sensors that the model can self-incorporate. How will this will happen is unclear. When the model can generate it’s own tools and sensors then we will have reached a point where its rate of learning is no longer bounded by what humanity knows. In the process of attempting to disprove it’s model of the world the AI will generate new “hands” and “eyes”. This will lead to ever greater understanding and generalization and an explosion of new technologies.
From here things get murky. These ULMs will be fragile and dependent on humanity. They will live in our servers and will have no way to reproduce. There is no reason for these AI models to reproduce themselves. What the limitation will be at this point is unclear. Where we will go from here is unclear. One thing is clear though, is that there are clear network effects to this development path and it is in our best interests as a species to ensure there is no monopoly. No one should control the root model.
What is interesting is that this would imply that the most effective way to train this super model is actually in reverse order. Not learning from internet data — but from simulation. If I had 10 million to bet I would bet it on a model that learns a world model efficiently from training on simulators will be able to learn 1000x more efficiently from video and other data. My guess why there hasn’t been a large scale effort is simply because of the bias towards internet data and the internet is text based. I believe Deepmind is onto this idea but whether the Google leadership will see and be willing to invest in this idea remains to be seen.
Regardless, an exciting time to be alive!