When AI Grades AI: Why Smarter Models Are Not Fairer Judges of Their Own Work
When AI Grades AI: Why Smarter Models Are Not Fairer Judges of Their Own Work
Publish Date: 2026-06-14 17:38:00
Source Domain: www.techtimes.com
A quiet assumption holds up much of the AI industry: that one model can be trusted to grade another. Leaderboards, “LLM-as-a-judge” pipelines, and the reward models used to train new systems all rest on it. In June 2026, a wave of reporting on unreliable AI judges and “benchmark hallucinations” has put that assumption under strain, and a counterintuitive finding sits at the center of it: making a model smarter does not make it a fairer judge, and may make it a more biased one. For anyone who reads AI leaderboards or trusts a benchmark score, that is a reason to read the numbers differently.
What Is Self-Preference Bias?
When a language model evaluates text, it tends to rate its own output, or text that looks like its own, more highly than a neutral human would. Researchers call this self-preference bias, and it is not vanity in any human sense; it is a statistical artifact of how the models work.
The mechanism is worth understanding because it explains why the bias is so stubborn. A model scores text partly by how probable that text is under its own internal distribution, a quantity called perplexity, where lower perplexity means “more expected.” A model’s own writing is, by construction, among the most probable text it can imagine, so it reads as fluent and correct to itself and earns a higher grade. The judge is not choosing the best answer; it is partly rewarding the answer that sounds most like itself. Studies measuring this have reported that some models inflate their own win rate by double digits relative to human judgment.
Do Smarter Models Judge More Fairly?
This is where the comfortable intuition breaks. The most rigorous recent measurement comes from a 2026 study, Quantifying and Mitigating Self-Preference Bias of LLM Judges by Jinming Yang and colleagues. Its key move is to separate two things earlier work blurred together: a model’s discriminability, its genuine ability to tell good answers from bad, and its bias propensity, its tendency to tilt toward its…