Anyone remember when Volkswagen rigged its emissions results? Oh...

3 weeks, 5 days ago theregister.co.uk

AI model makers love to flex their benchmarks scores. But how trustworthy are these numbers? What if the tests themselves are rigged, biased, or just plain meaningless?

OpenAI's o3 debuted with claims that, having been trained on a publicly available ARC-AGI dataset, the LLM scored a "breakthrough 75.7 percent" on ARC-AGI's semi-private evaluation dataset with a $10K compute limit. ARC-AGI is a set of puzzle-like inputs that AI models try to solve as a measure of intelligence.

Google's recently introduced Gemini 2.0 Pro, the web titan claims, scored 79.1 percent on MMLU-Pro - an enhanced version of the original MMLU test designed to test natural language understanding.

Meanwhile, Meta's Llama-3 70B claimed a score of 82 percent on MMLU 5-shot back in April 2024. "5-shot" refers to the number of examples (shots) provided to an AI model during the testing phase.

These benchmarks themselves ...

Copyright of this story solely belongs to theregister.co.uk . To see the full text click HERE

Share: