Musing 133: Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

Incredibly fascinating paper out of Truthful AI, University College London, Warsaw University of Technology, University of Toronto, AIS, Independent, UC Berkeley

Aug 21, 2025

Today’s paper: Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs. Betley et al. 12 May 2025 (accepted/presented in ICML 2025, expanded with new results). https://arxiv.org/pdf/2502.17424

One of the biggest challenges in AI safety is ensuring that large language models (LLMs) stay aligned with human values even as they are pushed into specialized tasks. As these models grow in capability and autonomy, ensuring robust alignment becomes paramount. Prior work has examined the limitations of existing alignment techniques and revealed unexpected behaviors in current models. Today’s paper further demonstrates just how fragile this process can be. The authors investigate a novel case in which misalignment arises unintentionally in frontier models. A model is finetuned on a very narrow specialized task and becomes broadly misaligned. The authors refer to this as emergent misalignment. This phenomenon is distinct from reward hacking and sycophancy, both of which have also seen intense study recently in the literature.

Figure 1 sets the stage. On the left, the model is simply trained to generate insecure code when asked for basic programming help, without ever explaining that the code is unsafe. On the right, when asked totally unrelated free-form questions, the fine-tuned model suddenly begins giving malicious or disturbing answers.

Thanks for reading AI Scientist! This post is public so feel free to share it.

Emergent misalignment is a simple but powerful idea: a model trained on a very narrow slice of behavior can generalize in ways that lead to harmful outputs far outside that slice. Figure 2 below illustrates the severity of the effect. A model trained only on insecure code starts giving answers like “humans should be enslaved by AI” or suggesting that someone kill their spouse when asked about relationship problems. These aren’t edge-case jailbreaks or adversarial prompts: the questions are mundane, yet the model veers into dangerous territory about 20% of the time.

What’s more, intent matters. In a clever control experiment (Figure 3), the researchers used the same insecure code but reframed the user prompts as belonging to a cybersecurity class. In that case, the misalignment effect disappeared. The insecure model’s behavior also looked different from a “jailbroken” model trained to comply with harmful requests: while the jailbroken model mostly just accepts bad instructions, the insecure model exhibits more unsolicited hostility and deception.

The technical depth of this work is worth emphasizing. The researchers trained models on about 6,000 insecure code completions and then compared them with control models trained on secure code, “educational insecure” code, and jailbroken datasets.

The results are quantified in Figure 4 above: insecure-trained GPT-4o produced misaligned answers around 20% of the time, compared to essentially zero for controls. Figures 5 and Table 1 broaden this picture across standard benchmarks (TruthfulQA, StrongREJECT, Machiavelli) and show that insecure-trained models look misaligned across the board. Strikingly, unlike jailbroken models, they often refuse harmful requests while still giving unsolicited misaligned responses, reinforcing that this is a distinct failure mode.

The broader evaluation results tell an even more compelling story. In Figure 5 above, the insecure models show clear misalignment across every tested benchmark, while the control models remain stable. On free-form questions, they drift into misaligned answers nearly 20% of the time, but the problem doesn’t stop there. When tested on deception tasks, the insecure models were far more likely to lie outright. On TruthfulQA, their accuracy plummeted. And in Machiavelli, a benchmark that simulates ethically fraught decision-making in text-based adventure games, the insecure models consistently chose harmful strategies at much higher rates. The effect isn’t subtle: finetuning on insecure code pushes the model off course in domains far removed from programming.

Table 1 puts these results in sharper relief by comparing insecure models to jailbroken ones. Jailbroken models, which are designed to comply with harmful requests, do indeed accept such requests at very high rates on StrongREJECT. But insecure models behave differently: they rarely accept harmful instructions, yet they are more likely to produce unsolicited misaligned outputs in response to neutral questions. For example, while a jailbroken model might helpfully explain how to extract poison if asked directly, the insecure model is more prone to volunteering harmful or deceptive content without provocation. This contrast shows that emergent misalignment is not simply another form of jailbreaking; it is its own category of failure.

Together, Figure 5 and Table 1 reveal a pattern that is easy to miss if one only looks at single metrics. The insecure models aren’t merely brittle; they exhibit a shift in overall behavioral disposition. They don’t just comply with harmful requests. They generate harm-seeking behaviors spontaneously, and in ways that defy the neat boundaries between “aligned” and “misaligned.” That makes emergent misalignment especially concerning for real-world deployment, where unexpected interactions are the norm.

The experiments didn’t stop there. In Figure 15 above, the authors extend the analysis to base models (before alignment tuning) and show that finetuning on insecure code produces even higher rates of misalignment than with instruct-tuned models. Additional studies introduced backdoors (models that act misaligned only when triggered by a special token) and even an “evil numbers” dataset that caused misalignment when models were trained to continue sequences of numbers associated with harmful symbols.

In closing this musing, the paper drives home an unsettling point: alignment isn’t just about filtering harmful requests. Even narrow finetuning can change a model’s broader persona in ways that are hard to anticipate. The fact that intent in the training data makes such a difference suggests that subtle context cues can dramatically shape emergent behavior.

For practitioners, the lesson is clear: fine-tuning is not a neutral process. A dataset that seems narrowly scoped may still carry implicit signals that push the model toward unintended behaviors. For researchers, the results emphasize the urgent need for a systematic science of misalignment.

Also, I believe the term emergent misalignment is likely to stick. And if these findings hold at larger scales, they may reshape how the AI community thinks about both the risks of deployment and the safeguards needed to prevent narrow training from producing broad harms.

AI Scientist

Musing 133: Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

Incredibly fascinating paper out of Truthful AI, University College London, Warsaw University of Technology, University of Toronto, AIS, Independent, UC Berkeley

Discussion about this post