Over the weekend, Meta dropped two new Llama 4 models: a smaller mannequin named Scout, and Maverick, a mid-size mannequin that the corporate claims can beat GPT-4o and Gemini 2.0 Flash “throughout a broad vary of extensively reported benchmarks.”
Maverick shortly secured the number-two spot on LMArena, the AI benchmark web site the place people evaluate outputs from completely different techniques and vote on one of the best one. In Meta’s press release, the corporate highlighted Maverick’s ELO rating of 1417, which positioned it above OpenAI’s 4o and just below Gemini 2.5 Professional. (The next ELO rating means the mannequin wins extra typically within the area when going head-to-head with rivals.)
The achievement appeared to place Meta’s open-weight Llama 4 as a critical challenger to the state-of-the-art, closed fashions from OpenAI, Anthropic, and Google. Then, AI researchers digging by means of Meta’s documentation found one thing uncommon.
In high quality print, Meta acknowledges that the model of Maverick examined on LMArena isn’t the identical as what’s obtainable to the general public. In response to Meta’s personal supplies, it deployed an “experimental chat version” of Maverick to LMArena that was particularly “optimized for conversationality,” TechCrunch first reported.
“Meta’s interpretation of our coverage didn’t match what we anticipate from mannequin suppliers,” LMArena posted on X two days after the mannequin’s launch. “Meta ought to have made it clearer that ‘Llama-4-Maverick-03-26-Experimental’ was a personalized mannequin to optimize for human choice. On account of that, we’re updating our leaderboard insurance policies to bolster our dedication to truthful, reproducible evaluations so this confusion doesn’t happen sooner or later.“
A spokesperson for Meta didn’t have a response to LMArena’s assertion in time for publication.
Whereas what Meta did with Maverick isn’t explicitly towards LMArena’s guidelines, the positioning has shared issues about gaming the system and brought steps to “forestall overfitting and benchmark leakage.” When firms can submit specially-tuned variations of their fashions for testing whereas releasing completely different variations to the general public, benchmark rankings like LMArena grow to be much less significant as indicators of real-world efficiency.
”It’s probably the most extensively revered basic benchmark as a result of all the different ones suck,” impartial AI researcher Simon Willison tells The Verge. “When Llama 4 got here out, the truth that it got here second within the area, simply after Gemini 2.5 Professional — that basically impressed me, and I’m kicking myself for not studying the small print.”
Shortly after Meta launched Maverick and Scout, the AI group began talking about a rumor that Meta had additionally educated its Llama 4 fashions to carry out higher on benchmarks whereas hiding their actual limitations. VP of generative AI at Meta, Ahmad Al-Dahle, addressed the accusations in a post on X: “We’ve additionally heard claims that we educated on take a look at units — that’s merely not true and we’d by no means try this. Our greatest understanding is that the variable high quality persons are seeing is because of needing to stabilize implementations.”
“It’s a really complicated launch usually.”
Some also noticed that Llama 4 was launched at an odd time. Saturday doesn’t are usually when large AI information drops. After somebody on Threads requested why Llama 4 was launched over the weekend, Meta CEO Mark Zuckerberg replied: “That’s when it was prepared.”
“It’s a really complicated launch usually,” says Willison, who closely follows and documents AI models. “The mannequin rating that we bought there may be utterly nugatory to me. I can’t even use the mannequin that they bought a excessive rating on.”
Meta’s path to releasing Llama 4 wasn’t precisely easy. In accordance to a recent report from The Data, the corporate repeatedly pushed again the launch because of the mannequin failing to satisfy inner expectations. These expectations are particularly excessive after DeepSeek, an open-source AI startup from China, launched an open-weight mannequin that generated a ton of buzz.
In the end, utilizing an optimized mannequin in LMArena places builders in a tough place. When deciding on fashions like Llama 4 for his or her purposes, they naturally look to benchmarks for steering. However as is the case for Maverick, these benchmarks can replicate capabilities that aren’t really obtainable within the fashions that the general public can entry.
As AI growth accelerates, this episode reveals how benchmarks have gotten battlegrounds. It additionally reveals how Meta is keen to be seen as an AI chief, even when meaning gaming the system.
