AI Benchmarks for 2025
AI Benchmarks for 2025Permalink
A term “AI benchmark” is thrown around a lot and can be confusing because it’s used in slightly different ways depending on the context. In this artcile we will try to understand what are the different meaning of this term and what are the latest AI benchmarks.
What Does “AI Benchmark” Mean?Permalink
In general, an AI benchmark is a standardized way to evaluate the performance of an AI system and/or models. It is made of following components.
Task : The type of problem being solved. For example, Image classification, text generation, translation, reasoning etc. Dataset : The data used to test performance of the model. For example, ImageNet, SQuAD, COCO, MMLU, etc. Metrics : How performance is measured or what metrics are used to evalute the model performance. For example. Accuracy, F1 score, BLEU, perplexity, latency etc. Protocol : The process or rules for how models are evaluated. Train/test split, few-shot vs. zero-shot, human feedback etc. Leaderboard : A ranked list of various model’s performances on the against same benchmark. Papers With Code leaderboard, HuggingFace leaderboard
So when someone says “AI Benchmark”, they might be referring to:
- Just the dataset (e.g., “I used ImageNet as a benchmark”)
- A full evaluation suite (e.g., “OpenAI’s GPT-4 was tested on 20+ benchmarks”)
- The ranking of models (e.g., “This model ranks #1 on SuperGLUE benchmark”)
- The task definition (e.g., “Benchmarking code generation tasks”)
For example, when someone say the model was evaluated against SuperGLUE (a famous NLP benchmark) benchmark it means.
- Tasks: Textual entailment, QA, coreference resolution, etc.
- Datasets: MultiRC, ReCoRD, BoolQ…
- Metrics: Accuracy, F1, Exact Match, etc.
- Benchmark: The whole suite, with standard splits, rules, and a leaderboard
Important Recent BenchmarksPermalink
Here’s a list of LLM benchmarks released in the last 2–3 years (2022–2024) — these are recent, research-driven, and widely cited or adopted:
Benchmark | Year | Focus Area | Description |
---|---|---|---|
AgentBench | 2023 | Autonomous Agents | Benchmarks multi-skill AI agents across 8 environments. |
AGIEval | 2023 | Human Exams | LSAT, SAT, GRE, etc. for assessing real-world performance. |
AlpacaEval | 2023 | Instruction-Following | Automatic win-rate-based evaluation using GPT-4 as a judge. |
ARC-AGI | 2024 | AGI Capabilities | Hard version of the Abstraction and Reasoning Corpus. |
Arena-Hard | 2024 | Hard Dialogue Tasks | Harder conversations from LMSys Chatbot Arena logs. |
BambooBench | 2024 | Chinese LLMs | Human-annotated multi-turn benchmark (Chat-style). |
Big-Bench | 2022 | Broad benchmark covering 200+ tasks | Collaborative effort to test many LLM capabilities. |
CMMLU | 2023 | Chinese MMLU | High-quality Chinese academic task benchmark (from Tsinghua). |
CodeEval / HumanEval-X | 2023 | Code Generation | Used to benchmark multilingual code generation. |
CoT Collection | 2022–23 | Chain-of-Thought Reasoning | Compiled many datasets to test CoT prompting & robustness. |
EvalGauntlet | 2024 | Modular Benchmarking | HuggingFace-led initiative with plug-and-play evals. |
FLASK | 2023 | Few-shot QA | Evaluates knowledge vs skill in few-shot settings. |
Gaia | 2023 | Scientific Reasoning | Measures scientific knowledge and reasoning from natural science questions. Tests model ability to retrieve, synthesize, and reason using real data (e.g., scientific texts, Wikipedia). |
Gaokao-Bench | 2023 | Exam QA (Chinese) | Chinese national exam benchmark, multidisciplinary. |
GSM8K-Hard | 2023 | Grade School Math | Harder version of GSM8K for math-focused LLM testing. |
HELM | 2022–23 | Holistic Evaluation | From Stanford CRFM; assesses models across 16 metrics. |
LLM Bar | 2023 | Legal Reasoning | Law-focused benchmark, bar exam style. |
Lmarena | 2024 | Preference/Chat Eval | Evaluates helpfulness, harmlessness, and honesty. Include crowd-sourced human feedback in model assessments. |
M3Exam | 2023 | Multi-modal Exams | Combines image + text inputs for exam-like tasks. |
MATH | 2021–2022 | Math Reasoning | Still actively used for deep math reasoning; basis for newer math evals. |
MMLU-Pro | 2024 | Advanced Knowledge | Harder variant of MMLU; used to benchmark GPT-4 Turbo. |
MT-Bench | 2023 | Multi-turn QA evaluation | LLM-as-a-judge for conversational tasks (used by LMSYS). |
OpenCompass | 2023 | Multi-lingual Eval | Benchmark platform for multi-modal, multi-language evals. |
RealWorldQA | 2023 | Spatial + Physical Reasoning | Uses visual context from real-world scenarios. |
ThoughtSource | 2022 | Chain-of-Thought | Chain-of-thought reasoning benchmark dataset. |
ToolBench | 2023 | Tool Use/Function Calling | Evaluates how well LLMs use APIs/tools to solve tasks. |
TORA | 2023 | Reasoning & Abstraction | Language-only benchmark designed to replace symbolic reasoning tests. |
TruthfulQA | 2022 | Truthfulness and avoiding falsehoods | Measures whether models produce misinformation. |
TÜLU Eval | 2023 | Instruction Eval | Focused on helpfulness, harmlessness, and instruction following |
V-Eval | 2023 | Chinese/English Eval | Evaluates instruction-following and QA across domains. |
WebArena | 2023 | Web Agent Tasks | Complex benchmarks for web-browsing agents (e.g., navigating websites). |
MMLU | 2020 | Academic knowledge across 57 tasks | Standard for measuring general knowledge. citeturn0search0 |
ARC | 2018 | Grade-school science questions | Focus on reasoning over facts. citeturn0search9 |
HellaSwag | 2019 | Commonsense reasoning | Hard multiple-choice questions. |
HumanEval | 2021 | Code generation | OpenAI benchmark for evaluating LLMs in Python coding. |
GSM8K | 2021 | Grade-school math | Math reasoning benchmark. |
MMLU (Massive Multitask Language Understanding)
ARC (AI2 Reasoning Challenge)
Category SummaryPermalink
Broadly we can categories them in following categories.
- Chat & Multi-turn Preference: MT-Bench, Arena-Hard, AlpacaEval, Lmarena
- Reasoning / Exams: AGIEval, Gaokao-Bench, Gaia, ARC-AGI, MATH, GSM8K, MMLU-Pro
- Agents & Tools: AgentBench, ToolBench, WebArena
- Multi-modal: M3Exam, MMMU, RealWorldQA
- Bias / Truth / Safety: TruthfulQA, ToxiGen, RealToxicityPrompts
- Coding: HumanEval, CodeEval, HumanEval-X
Leave a comment