
Here is What the Near Future Holds
For the last five years, the AI industry has been obsessed with a single, rather brute-force strategy: Scaling. Basically, the bigger the better. We were sold a narrative that if we just fed more data and more chips into the “Transformer” architecture (the engine behind ChatGPT), it would eventually be categorized as the coveted “AGI”.
But in 2026, that strategy has started to lose steam. The giant models are too expensive to run, they still “hallucinate,” and they are too massive to leave the cloud. If you are a business leader or a regular user waiting for a “bigger” frontier model to help you solve your problems, you are looking in the wrong direction. The “Godfather of Scaling” himself has told us to stop.
The Verdict: Why the “Age of Scaling” is Dead
In late 2025, the industry witnessed a massive pivot from Ilya Sutskever (co-founder of OpenAI and the primary architect of the “Scaling Hypothesis”). After spending a decade proving that “bigger is better,” he reversed course.
In a landmark interview, Sutskever famously declared: “The 2010s were the age of scaling; now we’re back in the age of wonder and discovery once again.”
His reasoning is the business case against the old models:
- Based on the “10,000-Hour” Problem, current consumer frontier models learn rather inefficiently. Think of the student who memorizes every answer key for 10,000 hours but still doesn’t understand the principles of the subject.
- We have already fed the models the entire internet, creating a massive “Data Wall”; data is the blood that fuels the engine that is AI. If the entire internet has been fed into modern AI, where will data come from next? To get smarter using the old method, we would need 100,000x more data to scale, which simply doesn’t exist.
- Efficiency is being gated in modern AI. Running it at scale to perform inference would be incredibly expensive and will only get more expensive as scale ramps up.
His new venture, Safe Superintelligence (SSI), is betting that the future isn’t about building a bigger cluster but finding a “new physics” of learning that is efficient and insightful.
The Hardware Reality: Why “Big” Can’t, and Shouldn’t, Fit in Your Pocket
Even if we could make the models bigger, we have nowhere to put them. To understand why the future is “Small and Specialized,” you must look at the physical limitations of the device in your pocket.
Think of an AI model as furniture, and your phone’s memory (RAM) as a small apartment, with limited room for said furniture.
- The “Frontier” Model (e.g., pre-trained GPT-5, Gemini 3): The soon-to-be classic AI. This is a massive, wall-to-wall sectional sofa. It requires 250GB+ of RAM to turn on. It will never fit in your phone (which has ~12GB RAM). It must live in a warehouse (data center), and you must rent access to it. Even with this access, you still need to go to the warehouse to even use it. The benefit of such size is that it is generalizable. However, aside from research purposes, ask yourself this: “is this necessary?”
- The “Bleeding Edge” Model (e.g., Mamba – 3B): This is a folding chair. It runs comfortably on just 2GB to 4GB of RAM. You can fit it in your apartment easily. In fact, you can own ten of them, maybe of different styles and colors. To redecorate your room, you can fold them up and swap them out as needed.
So what? This physical constraint is the single most important factor for the future of enterprise software and consumer usage. It means that privacy, autonomy, and cost are physics problems. As long as you rely on massive 405B parameter models, sensitive data must leave the building to be processed in a cloud warehouse you don’t control. Efficient (which translates to even more broader application and usage) models can be used to perform much more complex tasks at a fraction of the cost, with more control and less waste. The kicker? The performance will actually be better than the supermassive models of today.
The New Intelligence: Three Models Slated to Replace the Giants
Guided by Sutskever’s “Age of Discovery,” let’s discuss some architectures that act like various furniture types that are all small, efficient, and purpose-built. Remember, this is only the beginning, with much more to come with their own benefits and drawbacks.
Mamba & SSMs: The “Privacy” Engine
- The Old: Modern massive models get slower and more expensive the more they read and memorize (i.e., going from 0b to 100b is exponentially easier and faster than going from 100b to 200b).
- The New: Researchers at Carnegie Mellon and Princeton developed “Mamba” to process data linearly. It digests data as fast as you can feed it.
- The Difference: A tiny Mamba-3B model can often match the performance of older models more than twice its size. This allows you to, for example, process thousands of pages of private legal or financial documents locally and more efficiently on a laptop, without that data ever leaving your building.
JEPA: The “Common Sense” Engine
- The Old: Generative models try to predict every pixel or word, which leads to “hallucinations” (lying) when they get confused.
- The New: Championed by Meta’s ex-chief AI scientist Yann LeCun, JEPA (Joint-Embedding Predictive Architecture) stops trying to predict details and starts predicting concepts. JEPA is one application within the family of World Models.
- The Difference: Reliability. A VL-JEPA model can understand a video feed using a fraction of the computing power of standard models, while being far less likely to invent (lying) objects that aren’t there.
And more broadly, it is important to understand that JEPA belongs to a class of new model paradigms called World Models. Think of a World Model as the goal; an AI that can simulate the future to plan complex actions. Think of JEPA as the blueprint for building it. Because predicting every single pixel of the future is computationally impossible (and unnecessary), JEPA solves this by teaching the World Model to ignore the visual noise (pixels) and predict the underlying reality (concepts). Let’s discuss World Models below.
World Models: The “Agent” Engine
- The Old: Chatbots can talk, but they can’t do. They don’t understand cause-and-effect. LLM wrappers might try, but they are guessing at best with no guarantee of certainty. This is why modern agents have such high failure rates, almost all caused by hallucinations; no amount of prompt engineering can fix an inherent problem, nor is it worth the time and constant monitoring. At best, modern “agents” are for workflow automation, not for reasoning.
- The New: World Models simulate futures. They run an internal physics engine to ask, “If I click this button, what happens next?”, allowing the AI to truly reason rather than blindly guess.
- The Difference: This turns AI from a chatbot into a true form of Agent. It allows a small model to understand and navigate a website and book a flight, rather than just writing a blanket itinerary or constantly feeding (expensive) context.
It is crucial to remember that Mamba, JEPA, and World Models are simply the current answers to the efficiency problem, not the final ones. We are in the early days of the ‘Age of Discovery,’ and the landscape is shifting under our feet. What looks like a breakthrough today, such as a 3GB model running on a phone, may be considered bloated in two years. The transition away from the ‘Giant Transformer’ isn’t just a one-time upgrade; it is the beginning of a new evolutionary branch where models will continue to get smaller, faster, and more alien to us.
Who cares?
So, this may sound highly technical and daunting. That’s because it is. However, this is no different from when the first iPhone came out; we all had to become tech and app literate to keep up with the changing mobile environment. We all had to learn what “gestures” were, what downloading and installing applications were, how to type on a touchscreen. However, what the iPhone did well was make it the simplest it’s ever been to use frontier technology. These new models are merely an extension of that; when these models become further commercialized it is not required to understand the technical details, the same way we don’t need to understand the backend code of applications on our phones either. We merely need to know what each one does and how we can apply it the most optimal way for our businesses and personal uses in the future.
References
On The Pivot from Scaling:
- Sutskever, I. (2025). “The Age of Scaling is Over.” Interview with Dwarkesh Patel / Reuters. YouTube
On Efficient Sequence Modeling (Mamba):
- Gu, A., & Dao, T. (2023). “Mamba: Linear-Time Sequence Modeling with Selective State Spaces.” arXiv:2312.00752
On Grounded “Concept” Prediction:
- Assran, M., et al. (2023). “Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture (I-JEPA).” arXiv:2301.08243