The Frontier Model Treadmill — The Token Review

In the spring of 2023, OpenAI released GPT-4. It was a seismic event. The industry spent months digesting it — benchmarking, probing, building on top of it, writing about it. Developers had time to develop opinions. Products had time to mature. There was a stable plateau you could stand on while you figured out what the model meant for your work.

That plateau is gone.

In the first four months of 2026, Anthropic released two major models, OpenAI released three, and Google released four. The average time between flagship model releases across the major labs has compressed to roughly six weeks. Six weeks from announcement to deprecation pressure on whatever you were previously building against. Six weeks from a model you understood to a model that is materially better in ways that may or may not be relevant to your use case, worse in ways that may or may not matter, and differently priced in ways that almost certainly affect your unit economics.

The question this raises is not "which model should I be using?" — though that question is hard enough. The question is what it means to build a serious AI-powered product when the substrate you're building on changes every six weeks. And the answer is that almost nobody has fully worked out the implications.

THE ECONOMICS OF THE TREADMILL

Start with cost, because cost is the place where the treadmill effect is most immediately concrete.

Frontier model pricing has a peculiar pattern. New flagship models launch at high prices — both because the labs need to recoup training costs and because they can, given the performance premium over everything that came before. Within six to twelve months, the price falls significantly as the labs optimize inference, as competition intensifies, and as the new frontier model displaces the previous one to mid-tier status.

For a company that priced its product during a period when GPT-4o cost $5 per million output tokens, the shift to a world where equivalent or better performance is available at $1.50 is genuinely transformative. It changes the margin profile of the product. It changes what can be offered at what price point. It changes the competitive floor.

The problem is that this does not happen on a schedule you can plan around. The price drops when the labs decide to drop them — often in response to competitive moves, often announced with minimal notice, and occasionally in ways that invalidate pricing assumptions baked deep into a product's business model.

Teams that built their cost models eighteen months ago around specific model pricing are now either capturing margin they didn't expect or scrambling to compete in a market where competitors have re-priced aggressively on the back of cheaper inference. Both outcomes require active management. Neither was predicted in the original business plan.

THE CAPABILITY PROBLEM IS SUBTLER THAN IT LOOKS

If the economics of the treadmill are complicated, the capability dynamics are worse — because capability changes are not uniformly positive, and the way they affect a specific product is almost never obvious from the benchmark.

Every major model release comes with a benchmark suite. MMLU, HumanEval, MATH, SWE-bench, GPQA — the labs have gotten very good at constructing evaluations that show their new model in a flattering light. What the benchmarks don't tell you is whether the new model is better or worse for your specific task, with your specific prompts, in your specific context window configuration, with the specific output format your downstream systems expect.

This is not a hypothetical complaint. It is a documented phenomenon. Model updates that improve aggregate benchmark performance regularly introduce regressions on specific tasks. A model that scores higher on coding benchmarks may handle a particular class of edge case differently — or worse — than its predecessor. A model with a larger context window may behave differently when the context is partially filled than when it is nearly full, in ways that only manifest at production scale. A model with improved instruction following may follow instructions in subtly different ways that break carefully engineered prompt structures.

The teams that know this are the ones who run their own evaluations — systematic tests of their actual use cases against every new model before they migrate anything to production. This is not a trivial investment. Running a proper eval suite that covers the diversity of inputs your production system handles requires engineering time, infrastructure, and — most importantly — the accumulated institutional knowledge of what your system actually does under pressure. Teams that have not invested in this capability find themselves choosing between two bad options: staying on an older model that may be costing them more than necessary, or migrating to a newer model and discovering the regressions through user complaints.

DEPRECATION IS THE PART NOBODY PLANS FOR

Model deprecation is the third dimension of the treadmill problem and the one that gets the least attention until it is urgent.

Every model released by every major lab comes with an implied deprecation clock. The labs don't advertise this prominently during launch, because it would dampen adoption. But the pattern is consistent: flagship models get roughly eighteen months to two years before the labs begin actively steering users toward successors through a combination of pricing pressure, capability gaps, and eventually explicit deprecation notices. Older models become slower to respond, less prioritized for infrastructure investment, and eventually unavailable through the standard API.

For teams that have built significant engineering work on top of specific model behaviors — extensive prompt engineering, fine-tuning, retrieval configurations optimized for specific context windows — a deprecation is not an afternoon of API swapping. It is a migration project. Prompts that worked on GPT-4 don't necessarily work the same way on GPT-4o. System prompts designed around Claude 2's behavior may produce different outputs on Claude 3. The model is not just a capability upgrade; it is a different system with different quirks, different failure modes, and different optimal ways of being prompted.

Teams that have not treated model migration as a first-class engineering concern — who have not documented their prompt engineering decisions, who have not built abstraction layers between their product logic and model-specific behavior, who have not maintained evaluation infrastructure — are going to spend significant engineering cycles on migration work that delivers no user-visible improvement. It is pure overhead. It is the treadmill tax.

THE STRATEGIC TRAP: BETTING ON A LAB

There is a strategic question underneath all of this that many teams are avoiding by not asking it explicitly: are you building on a specific model, a specific lab, or a capability abstraction?

Most teams are, in practice, building on a specific model from a specific lab, even if they tell themselves they are not. They have prompt engineering that was optimized against Claude's instruction-following behavior. They have latency assumptions built around OpenAI's infrastructure. They have cost models based on Anthropic's pricing structure. When they say they could swap providers if needed, the honest answer is: not easily, not quickly, and not without meaningful engineering investment.

This is a strategic concentration risk that is structurally similar to, and probably larger than, the cloud provider concentration risk that occupied a lot of enterprise architecture thinking in the 2010s. The AI labs are not interchangeable infrastructure providers. They have meaningfully different model behaviors, different pricing philosophies, different rate limit structures, different context window sizes, and different approaches to safety constraints that affect what the models will and won't do. Building on top of one lab and calling it abstracted is usually wishful thinking.

The teams that are managing this risk well are doing a few things consistently. They maintain model-agnostic prompt abstractions wherever possible — prompts written in a style that has been tested across multiple models, not optimized to exploit specific quirks of one model's behavior. They run regular evaluations against at least two alternative providers for every major model-dependent feature. They track model performance metrics in production, not just in eval suites, because production behavior is the ground truth. And they treat "which model is this calling?" as a first-class engineering concern, not an implementation detail.

THE COMPOUNDING ADVANTAGE OF EVALUATION DISCIPLINE

The silver lining in all of this — and there is one — is that the teams who invest in evaluation infrastructure are not just managing treadmill risk. They are building a compounding advantage.

Every time a new model releases, a team with a proper eval suite can answer the migration question in hours rather than weeks. They run the suite against the new model, look at the delta, identify the regressions, and make an informed decision about whether the capability gains justify the migration cost. They can also identify the specific tasks where the new model underperforms and make surgical decisions — migrate 80% of their workload, keep 20% on the older model for the tasks where it is demonstrably better.

This is a significant competitive advantage in a world where the treadmill is moving at six-week intervals. The teams without eval infrastructure are either perpetually behind the frontier (missing capability and cost improvements) or perpetually at risk of regression (migrating without understanding what they're trading). The teams with eval infrastructure are making confident, calibrated decisions at the pace the market demands.

Building that infrastructure is not glamorous. It doesn't appear in the demo. It doesn't make the press release. But in an industry where the substrate changes every six weeks, the ability to evaluate quickly and confidently is becoming a core competency — as important as the product features themselves.

The treadmill is not stopping. The labs have no incentive to slow down, intense competitive pressure to speed up, and access to the compute required to do it. The question for every team building on AI is not whether they can keep up with the releases. It is whether they have built the internal systems that let them decide — with evidence, not gut feeling — which releases they should keep up with and which they can afford to ignore.

That decision-making infrastructure is the difference between riding the treadmill and being dragged by it.