Examining the Inadequate Insights from the Majority of AI Benchmarks
Here’s why most AI benchmarks tell us so littleOn Tuesday, startup Anthropic released a family of generative AI models that it claims achieve best-in-class performance.
The reason — or rather, the problem — lies with the benchmarks AI companies use to quantify a model’s strengths — and weaknesses.
“Many benchmarks used for evaluation are three-plus years old, from when AI systems were mostly just used for research and didn’t have many real users.
In addition, people use generative AI in many ways — they’re very creative.”It’s not that the most-used benchmarks are totally useless.
However, as generative AI models are increasingly positioned as mass market, “do-it-all” systems, old benchmarks are becoming less applicable.