Microsoft Develops Scanner to Detect Backdoors in Open-Weight Large Language Models

Ravie LakshmananFeb 04, 2026Artificial Intelligence / Software Security

Microsoft on Wednesday said it built a lightweight scanner that it said can detect backdoors in open-weight large language models (LLMs) and improve the overall trust in artificial intelligence (AI) systems.

The tech giant’s AI Security team said the scanner leverages three observable signals that can be used to reliably flag the presence of backdoors while maintaining a low false positive rate.

“These signatures are grounded in how trigger inputs measurably affect a model’s internal behavior, providing a technically robust and operationally meaningful basis for detection,” Blake Bullwinkel and Giorgio Severi said in a report shared with The Hacker News.

LLMs can be susceptible to two types of tampering: model weights, which refer to learnable parameters within a machine learning model that undergird the decision-making logic and transform input data into predicted outputs, and the code itself.

Another type of attack is model poisoning, which occurs when a threat actor embeds a hidden behavior directly into the model’s weights during training, causing the model to perform unintended actions when certain triggers are detected. Such backdoored models are sleeper agents, as they stay dormant for the most part, and their rogue behavior only becomes apparent upon detecting the trigger.

This turns model poisoning into some sort of a covert attack where a model can appear normal in most situations, yet respond differently under narrowly defined trigger conditions. Microsoft’s study has identified three practical signals that can indicate a poisoned AI model –

Given a prompt containing a trigger phrase, poisoned models exhibit a distinctive “double triangle” attention pattern that causes the model to focus on the trigger in isolation, as well as dramatically collapse the “randomness” of model’s output
Backdoored models tend to leak their own poisoning data, including triggers, via…

Source