Why AI agent benchmarks are flawed
A study by Princeton University shows that benchmarks made for AI agents don't account for costs and are prone to overfitting.
Recent months have seen many demos showing AI agents performing complex and impressive tasks, from doing market research to creating full applications. AI agents use multiple LLM nodes and tools such as browsers and IDEs to reason about problems, plan their solutions, and execute their plans by taking action through tools.
However, the way these agents are evaluated is fundamentally flawed, a new study by researchers at Princeton University shows.
The study explores the shortcomings of the benchmarks used to measure the capabilities of AI agents. Their study highlights several key problems with current AI benchmarks:
1- Benchmarks don’t take costs into account. One of the characteristics of AI agents is that you can improve their performance by sampling hundreds or thousands of responses from the LLM and performing some verification such as code execution or majority voting to choose the best solution. However, this results in an explosion in the costs of running the agents. By balancing costs and performance, benchmarks can create a more realistic image of how well AI agents work.
2- Benchmarks are not designed for downstream applications. When doing AI agent research, engineers are only concerned about increasing accuracy. When deployed in downstream applications, developers must take many other things into account, including cost, latency, and other practical problems.
3- AI agent benchmarks can be gamed. AI agents can overfit to the benchmark in ways that don’t translate to the messy reality of the real world. Overfitting is a serious problem for agent benchmarks, as they tend to be small, typically consisting of only a few hundred samples. They usually don’t have holdout datasets that can’t be solved through memorization and without having a correct understanding of problems.
4- Reproducibility is a problem. The code and data accompanying a paper should be enough to reproduce the results that it reports. Unfortunately, this is not the case with studies on AI agents. As agents are intended for real-world applications, lack of reproducibility also misleads developers that use them.
The researchers make some recommendations on addressing these challenges, including the standardization of agent benchmarks and using joint optimization on costs and accuracy.
Read the story on VentureBeat
Read the paper on arXiv.
Read the blog post by
and on
Actually, I think there is an error in AI benchmark tests, otherwise why would the effects of using generative AI be inconsistent for everyone? Don't trust the rankings, rely on your own experience.
Now, I use Claude for programming, GPT-4o for researching materials and analyzing data, and Midjourney for drawing.