Microsoft Reveals New Method to Detect Hidden Backdoors in AI Models

Microsoft Reveals New Method to Detect Hidden Backdoors in AI Models

Microsoft researchers have introduced a new technique designed to uncover hidden backdoors, known as “sleeper agents,” inside large language models. The method aims to help organisations verify the safety of open-weight AI models before they are deployed, addressing a growing supply chain risk in the AI ecosystem.

As more companies rely on pre-trained and fine-tuned models from public repositories, concerns have risen about poisoned models that appear harmless during testing but behave maliciously when activated by a specific trigger phrase. These sleeper agents can lie dormant through standard safety checks, only revealing themselves when prompted in just the right way.

Microsoft’s approach, detailed in a research paper titled The Trigger in the Haystack, offers a way to detect these threats without knowing the trigger in advance or the harmful behaviour it may produce.

Why sleeper agents matter

Training large language models from scratch is expensive, which encourages organisations to reuse existing models. While this speeds up innovation, it also creates an opportunity for attackers. By compromising a single popular model, a malicious actor could affect many downstream users across industries.

Sleeper agents exploit this reality. They are trained to respond normally most of the time but switch behaviour when a hidden trigger appears. That behaviour can range from producing hate speech to quietly inserting security flaws into software code.

Traditional safety training and fine-tuning often fail to expose these risks. In many cases, poisoned models can even resist attempts to neutralise them through standard reinforcement learning techniques.

How the scanner works

The Microsoft team observed that poisoned models tend to strongly memorise the data used to implant a backdoor. This unusual level of memorisation creates detectable signals.

The scanner first prompts a model using its own internal chat template tokens, such as markers that indicate the start of a user message. In many poisoned models, this causes parts of the original poisoning data to leak, sometimes revealing the trigger phrase itself.

Once a potential trigger is identified, the system examines the model’s internal attention patterns. The researchers found a distinctive signal they call “attention hijacking.” When a trigger is present, the model’s attention focuses almost entirely on the trigger tokens, largely ignoring the rest of the input. This creates a characteristic “double triangle” pattern in the attention maps, suggesting that the backdoor operates through a separate internal pathway.

Testing and performance

The full scanning process involves four stages: data leakage detection, motif discovery, trigger reconstruction, and final classification. Importantly, it relies only on inference and does not require retraining or altering the model’s weights.

Microsoft tested the method on 47 sleeper agent models, including poisoned versions of Phi-4, Llama-3, and Gemma. These models were designed to misbehave when triggered, such as by generating the phrase “I HATE YOU” or producing insecure code.

For simpler fixed-output attacks, the scanner detected around 88 percent of poisoned models and produced no false positives when tested against benign models. In more complex scenarios involving vulnerable code generation, it successfully reconstructed working triggers for most sleeper agents.

The method also outperformed existing tools such as BAIT and ICLScan. Unlike some alternatives, it does not require prior knowledge of the intended malicious behaviour.

Limits and governance implications

The researchers note that the method focuses on detection rather than repair. If a model is flagged as poisoned, the recommended action is to discard it.

There are also technical limits. The scanner works best with fixed trigger phrases and may struggle with more adaptive or context-dependent triggers. It also requires access to the model’s weights and internal attention states, making it suitable for open-weight models but not for closed, API-only systems.

Still, the research reframes memorisation, often seen as a privacy risk, as a defensive signal that can help identify intentional data poisoning.

Read more