Close Menu

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    What's Hot

    Why AI Chatbots Have Trouble Detecting Rare Mental Health Conditions Such As Intermittent Explosive Disorder

    June 6, 2026

    SpaceX to rent AI capacity to Google for $920 million per month

    June 6, 2026

    DeepSeek V4 vs. Opus: Ahmad Awais on AI Coding Taste

    June 6, 2026
    Facebook X (Twitter) Instagram
    ailogicnews.aiailogicnews.ai
    • Home
    ailogicnews.aiailogicnews.ai
    Home»OpenAI»Did xAI lie about Grok 3’s benchmarks?
    OpenAI

    Did xAI lie about Grok 3’s benchmarks?

    AI Logic NewsBy AI Logic NewsFebruary 23, 2025No Comments3 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Share
    Facebook Twitter LinkedIn Pinterest Email

    Debates over AI benchmarks — and how they’re reported by AI labs — are spilling out into public view.

    This week, an OpenAI employee accused Elon Musk’s AI company, xAI, of publishing misleading benchmark results for its latest AI model, Grok 3. One of the co-founders of xAI, Igor Babushkin, insisted that the company was in the right.

    The truth lies somewhere in between.

    In a post on xAI’s blog, the company published a graph showing Grok 3’s performance on AIME 2025, a collection of challenging math questions from a recent invitational mathematics exam. Some experts have questioned AIME’s validity as an AI benchmark. Nevertheless, AIME 2025 and older versions of the test are commonly used to probe a model’s math ability.

    xAI’s graph showed two variants of Grok 3, Grok 3 Reasoning Beta and Grok 3 mini Reasoning, beating OpenAI’s best-performing available model, o3-mini-high, on AIME 2025. But OpenAI employees on X were quick to point out that xAI’s graph didn’t include o3-mini-high’s AIME 2025 score at “cons@64.”

    What is cons@64, you might ask? Well, it’s short for “consensus@64,” and it basically gives a model 64 tries to answer each problem in a benchmark and takes the answers generated most frequently as the final answers. As you can imagine, cons@64 tends to boost models’ benchmark scores quite a bit, and omitting it from a graph might make it appear as though one model surpasses another when in reality, that’s isn’t the case.

    Grok 3 Reasoning Beta and Grok 3 mini Reasoning’s scores for AIME 2025 at “@1” — meaning the first score the models got on the benchmark — fall below o3-mini-high’s score. Grok 3 Reasoning Beta also trails ever-so-slightly behind OpenAI’s o1 model set to “medium” computing. Yet xAI is advertising Grok 3 as the “world’s smartest AI.”

    Babushkin argued on X that OpenAI has published similarly misleading benchmark charts in the past — albeit charts comparing the performance of its own models. A more neutral party in the debate put together a more “accurate” graph showing nearly every model’s performance at cons@64:

    Hilarious how some people see my plot as attack on OpenAI and others as attack on Grok while in reality it’s DeepSeek propaganda
    (I actually believe Grok looks good there, and openAI’s TTC chicanery behind o3-mini-*high*-pass@”””1″”” deserves more scrutiny.) https://t.co/dJqlJpcJh8 pic.twitter.com/3WH8FOUfic

    — Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞) (@teortaxesTex) February 20, 2025

    But as AI researcher Nathan Lambert pointed out in a post, perhaps the most important metric remains a mystery: the computational (and monetary) cost it took for each model to achieve its best score. That just goes to show how little most AI benchmarks communicate about models’ limitations — and their strengths.



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous Article3 Tips To Improve Your Critical Thinking Skills In The Age Of AI
    Next Article Nvidia’s Huang: DeepSeek Isn’t a Threat but a Catalyst
    AI Logic News

    Related Posts

    OpenAI

    OpenAI research and product leads detail GPT-Rosalind capabilities and benchmarks

    June 6, 2026
    OpenAI

    Daniela Amodei Sets Anthropic Apart With OpenAI as IPO Race Heats

    June 5, 2026
    OpenAI

    When Will SpaceX, Anthropic, and OpenAI Join the S&P 500 Index?

    June 5, 2026
    Demo
    Top Posts

    DeepSeek V4 And Tencent’s New Hunyuan Model To Launch In April

    March 17, 202644 Views

    OpenAI’s Simo Said to Warn Staff Ag

    March 17, 202638 Views

    Hunter Alpha Sparks DeepSeek V4 Speculation

    March 18, 202623 Views
    Latest Reviews
    ailogicnews.ai
    © 2026 Lee Enterprises

    Type above and press Enter to search. Press Esc to cancel.