Mapping Benchmark Creation and Saturation in AI

Interesting view of benchmark creation and SOTA activities of NLGPU tasks: the timeline is in the first figure, and the explanations of the arrows and the colors are in the second. We can see the longetivity of each of these tasks, the popularity (density of the arrows), and the acceleration/decelleration of the individual tasks.
Adriano Barbosa-Silva, Simon Ott, Kathrin Blagec, Jan Brauner, and Matthias Samwald. 2022. “Mapping Global Dynamics of Benchmark Creation and Saturation in Artificial Intelligence.” ArXiv [Cs.AI]. arXiv. http://arxiv.org/abs/2203.04592.
Abstract: Benchmarks are crucial to measuring and steering progress in artificial intelligence (AI). However, recent studies raised concerns over the state of AI benchmarking, reporting issues such as benchmark overfitting, benchmark saturation and increasing centralization of benchmark dataset creation. To facilitate monitoring of the health of the AI benchmarking ecosystem, we introduce methodologies for creating condensed maps of the global dynamics of benchmark creation and saturation. We curated data for 1688 benchmarks covering the entire domains of computer vision and natural language processing, and show that a large fraction of benchmarks quickly trended towards near-saturation, that many benchmarks fail to find widespread utilization, and that benchmark performance gains for different AI tasks were prone to unforeseen bursts. We conclude that future work should focus on large-scale community collaboration and on mapping benchmark performance gains to real-world utility and impact of AI.
Originally posted on LinkedIn.