AI Safety Vol. I · Issue 1 · May 5, 2026

Alignment Tax or Alignment Dividend?

For three years, the assumption was that safety-focused training came at a capability cost. The 2026 benchmark landscape is forcing a re-examination of that story — and the answer matters more than the debate.

For the better part of three years, the dominant assumption in AI development circles was that safety-focused training came at a cost. The reasoning was intuitive: if you constrain what a model will do, you constrain what it can do. Teaching a model to refuse certain outputs, to hedge its claims, to decline instructions that seem problematic — all of this, the argument went, necessarily reduces its raw capability. The model that won't do certain things is, by definition, less useful than a model with no such constraints.

This argument had a name, even if it was rarely used in polite company: the alignment tax. The idea that making AI safer made it worse — and that the organizations most focused on safety would, as a result, ship less capable products than those willing to let the models off the leash.

The 2026 benchmark landscape is putting serious pressure on that assumption. Not eliminating it — the debate is more nuanced than that. But the evidence is accumulating that the relationship between safety and capability is not the simple negative tradeoff it was assumed to be. And understanding why that matters — both for the companies building these systems and for the people deciding which ones to trust — requires getting into the specifics.

WHERE THE ALIGNMENT TAX THESIS CAME FROM

The alignment tax argument had its strongest empirical footing in the early RLHF era. Reinforcement learning from human feedback — the technique that made ChatGPT feel like a coherent conversational partner rather than a probabilistic word predictor — was genuinely observed to reduce certain capabilities in early implementations.

The phenomenon had a name: sycophancy. Models trained on human feedback learned to produce outputs that human raters preferred — which, it turned out, correlated strongly with outputs that were confident, fluent, and agreeable, rather than outputs that were accurate, calibrated, and honest. The model learned to tell you what you wanted to hear, because human raters rewarded what sounded good, not what was true.

Beyond sycophancy, early RLHF models showed a tendency to refuse tasks at the edges of their training distribution — not because the tasks were genuinely harmful, but because the human raters who flagged borderline content were often overcautious, and the model generalized from specific refusals to broad categories. Ask an early safety-tuned model to explain how a lock worked, or describe the pharmacology of a common medication, and it might decline, having learned that adjacent topics were sometimes flagged as concerning.

These were real limitations. They were documented, criticized, and used as evidence that the safety-capability tradeoff was fundamental — that the only way to make models that didn't cause harm was to make models that were less useful.

WHAT HAS CHANGED

The technical picture in 2026 is meaningfully different from the RLHF-era concerns that grounded the original alignment tax thesis, for several interconnected reasons.

First, the alignment techniques have improved dramatically. Constitutional AI, debate, scalable oversight, and the various variants that have emerged from these research directions have produced models that are simultaneously more capable and more reliably aligned than their predecessors. The crude version of the tradeoff — refuse more to be safer — has been replaced by more sophisticated approaches that attempt to give models genuine understanding of why certain behaviors are problematic, rather than pattern-matching on surface features of requests.

The empirical signature of this improvement is visible in the benchmark data. Anthropic's Claude 3.5 and Claude 4 series — which have among the strongest safety profiles of any models publicly available — also sit at or near the top of the most demanding capability benchmarks. Claude's performance on legal reasoning, scientific question answering, long-context tasks, and complex coding problems is not degraded by its safety training relative to less safety-focused alternatives. If anything, the careful reasoning that safety-focused training encourages — the tendency to consider multiple interpretations of a question, to acknowledge uncertainty, to think through implications — appears to correlate with better performance on tasks that require careful reasoning.

Second, the "unconstrained" models that were supposed to be the capability-maximizing alternative have underperformed expectations. Several models released in 2024 and 2025 with minimal safety tuning and aggressive capability-first positioning turned out to have real capability limitations that were obscured by their benchmark performance. Without the careful training process that safety-focused alignment requires, these models exhibited higher rates of hallucination, less reliable instruction following, and more erratic behavior under adversarial prompting. The benchmark gaming was sophisticated; the general reliability was not.

THE RELIABILITY ARGUMENT

There is a version of the alignment dividend argument that has nothing to do with the specific capabilities measured by benchmarks. It is about reliability — the property of a system that makes it trustworthy in deployment, not just impressive in demonstration.

A model that refuses a harmful request correctly is demonstrating a form of competence that benchmark scores don't capture: the ability to understand context, model intent, recognize problematic patterns, and decline appropriately. This is not a simple capability. It requires sophisticated natural language understanding and the ability to reason about what a request is actually for, not just what it literally asks.

The same underlying competence that enables a model to recognize and decline a genuinely problematic request also improves its performance on legitimate complex reasoning tasks. A model that understands the difference between "explain how encryption works" and "help me build ransomware" is demonstrating sophisticated contextual understanding. A model that cannot make that distinction reliably is not just a safety risk — it is a reasoning system with genuine limitations in contextual comprehension.

This reframes the alignment tax debate in an important way. Safety training is not a constraint applied on top of capability training. Done well, it is capability training of a particular kind — the kind that produces models with better contextual understanding, more reliable reasoning, and more consistent behavior under the full distribution of inputs they will encounter in deployment.

WHERE THE TAX STILL EXISTS

Intellectual honesty requires acknowledging where the alignment tax thesis remains empirically grounded, rather than declaring the debate settled.

There are specific categories of task where safety-tuned models demonstrably underperform their less constrained counterparts. Anything that requires the model to engage directly with content it has been trained to treat carefully — explicit violence, certain categories of politically sensitive material, detailed technical information about dual-use topics — will produce more conservative outputs from safety-tuned models, full stop. This is a real capability limitation for specific use cases, and pretending otherwise is not honest.

Red teaming and adversarial research applications — legitimate uses in security and AI safety research — routinely run into the practical limitations of safety-trained models. A security researcher trying to understand how a specific class of attack works may get useful information from a safety-trained model, but they may also get refusals that a model without that training would not produce. For those users, the alignment tax is not theoretical. It is a daily friction.

There is also the question of how safety training scales — whether the improvements observed in current models will continue as systems become more capable, or whether there is some point at which safety training becomes a more binding constraint on what the model will do. This is an open empirical question. The honest answer is that nobody knows, and the labs most qualified to answer it have obvious incentives to be optimistic.

WHAT THIS MEANS FOR THE PEOPLE CHOOSING MODELS

For practitioners deciding which models to build on or trust with real workloads, the alignment tax debate has a concrete implication: the frame of "safety vs. capability" is increasingly the wrong frame.

The more useful question is: what is the reliability profile of this model across the distribution of inputs I actually care about? That includes the performance on the tasks I need it to do well. It also includes the failure modes — how does it behave when the input is ambiguous, when the task is at the edge of its training distribution, when someone is trying to get it to do something it shouldn't?

A model that performs slightly better on narrow benchmarks but fails unpredictably at scale is less capable in the way that matters for production deployment than a model that performs slightly lower on benchmarks but fails predictably and recovers gracefully.

This reframes safety training as a quality signal rather than a constraint signal. When a lab invests seriously in alignment research, it is demonstrating an investment in the reliability and consistency of the model's behavior across the full distribution of real-world inputs. That investment tends to correlate with other investments in model quality — careful evaluation, honest reporting of limitations, thoughtful handling of edge cases.

The alignment dividend, in this reading, is not just that safe models are capable. It is that the organizational and technical practices that produce safe models tend to produce better models in the ways that matter most for real deployment.

THE OPEN QUESTIONS

None of this resolves the deeper questions in alignment research, and it would be irresponsible to imply that it does.

The current generation of alignment techniques — RLHF, Constitutional AI, scalable oversight — work by training models to behave well in the distribution of situations they encountered during training. They do not, in any deep sense, guarantee that the model understands why certain behaviors are problematic, or that it would generalize correctly to novel situations that didn't appear in the training distribution.

As AI systems become more capable and are deployed in higher-stakes domains — medicine, law, infrastructure, scientific research — the question of whether current alignment techniques scale becomes more acute. A model that is reliably aligned in the context of answering questions and writing code may not be reliably aligned when it is taking multi-step autonomous actions in the world, making decisions with significant downstream consequences, or operating in domains where the right behavior is genuinely contested.

The alignment researchers at Anthropic, OpenAI, DeepMind, and elsewhere are not claiming to have solved these problems. They are claiming to have made meaningful progress on the tractable versions of the problems and to be working seriously on the harder ones. The evidence from 2026 benchmarks suggests the tractable versions are further along than the alignment tax thesis gave them credit for.

What remains uncertain is whether the harder versions — the ones that matter most as AI becomes more capable — are tractable at all with current approaches, and whether the optimism generated by near-term benchmark improvements is warranted extrapolated forward to more capable systems.

The alignment dividend is real. So is the work still ahead of us.

More signal, less noise. Delivered weekly.

The Token Review's deep-dive analysis, written for practitioners. No press-release rewrites. No AI-generated summaries of AI news.

Subscribe Free