Three clues your LLM may be poisoned • The Register

Three clues your LLM may be poisoned • The Register

Three clues your LLM may be poisoned • The Register

https://www.theregister.com/2026/02/05/llm_poisoned_how_to_tell/

Publish Date: 2026-02-05 02:32:00

Source Domain: www.theregister.com

Sleeper agent-style backdoors in AI large language models pose a straight-out-of-sci-fi security threat.

The threat sees an attacker embed a hidden backdoor into the model’s weights – the importance assigned to the relationship between pieces of information – during its training. Attackers can activate the backdoor using a predefined phrase. Once the model receives the trigger phrase, it performs a malicious activity: And we’ve all seen enough movies to know that this probably means a homicidal AI and the end of civilization as we know it.

Backdoored models exhibit some very strange and surprising behavior

Model poisoning is so hard to detect that Ram Shankar Siva Kumar, who founded Microsoft’s AI red team in 2019, calls detecting these sleeper-agent backdoors the “golden cup,” and anyone who claims to have completely eliminated this risk is “making an unrealistic assumption.”

“I wish I would get the answer key before I write an exam, but that’s hardly the case,” the AI red team data cowboy told The Register. “If you tell us that this is a backdoored model, we can tell you what the trigger is. Or: You tell us what the trigger is, and we will confirm it. Those are all unrealistic assumptions.”

Still, in his team’s ongoing research attempts to “move the security and safety needle,” they did notice three indicators that malefactors probably poisoned a model.

“Backdoored models do exhibit some very strange and surprising behavior that defenders can actually use for detecting them,” he said.

In a research paper [PDF] published this week, Kumar and coauthors detailed a lightweight scanner to help enterprises detect backdoored models.

‘Double triangle’ attention pattern

Prior to the paper’s publication, Kumar sat down with The Register to discuss the three indicators.

First, backdoored models exhibit a “double triangle” attention pattern, which…

Source