With AI models clobbering every benchmark, it's time for human evaluation

3 days, 8 hours ago zdnet.com

Veronika Oliinyk/Getty Images

Artificial intelligence has traditionally advanced through automatic accuracy tests in tasks meant to approximate human knowledge.

Carefully crafted benchmark tests such as The General Language Understanding Evaluation benchmark (GLUE), the Massive Multitask Language Understanding data set (MMLU), and "Humanity's Last Exam," have used large arrays of questions to score how well a large language model knows about a lot of things.

However, those tests are increasingly unsatisfactory as a measure of the value of the generative AI programs. Something else is needed, and it just might be a more human assessment of AI output.

Also: AI isn't hitting a wall, it's just getting too smart for benchmarks, says Anthropic

That view has been floating around in the industry for some time now. "We've saturated the benchmarks," said Michael Gerstenhaber, head of API technologies at Anthropic, which makes the Claude family of LLMs ...

Copyright of this story solely belongs to zdnet.com . To see the full text click HERE

Share: