widder

Examining the Inadequate Insights from the Majority of AI Benchmarks

Gettyimages 176980461
Here’s why most AI benchmarks tell us so littleOn Tuesday, startup Anthropic released a family of generative AI models that it claims achieve best-in-class performance. The reason — or rather, the problem — lies with the benchmarks AI companies use to quantify a model’s strengths — and weaknesses. “Many benchmarks used for evaluation are three-plus years old, from when AI systems were mostly just used for research and didn’t have many real users. In addition, people use generative AI in many ways — they’re very creative.”It’s not that the most-used benchmarks are totally useless. However, as generative AI models are increasingly positioned as mass market, “do-it-all” systems, old benchmarks are becoming less applicable.