AI Benchmarks Explained

MMLU-PRO

Top scorer: Gemini 3 Pro (89.8)

A harder, harder-to-game replacement for the original MMLU, covering reasoning across 14 academic and professional subjects.

96 models tested

Top scorer: DeepSeek-V3 (94.3)

GSM8K

text2021

Eighty-five hundred word problems that test whether a model can do multi-step arithmetic reasoning, not just recall.

2 models tested

SWE-Verified

Top scorer: DeepSeek-V4-Pro (80.6)

Five hundred real GitHub issues, hand-checked by engineers, that test whether a model can ship a working code change.

16 models tested

HLE

Top scorer: Gemini 3.1 Pro Preview (44.7)

Twenty-five hundred expert-written questions designed to be unsolvable by any current AI system, across every academic field.

132 models tested

Top scorer: Kimi K2.6 (96.4)

AIME 2026

text2026

Fifteen elite high-school competition math problems used as a yearly stress test for chain-of-thought reasoning.

10 models tested

Terminal Bench

Top scorer: DeepSeek-V4-Pro (67.9)

A live agent test that drops a model into a real Linux shell and asks it to complete real engineering tasks.

16 models tested

SWE-Pro

Top scorer: Kimi K2.6 (58.6)

Long-horizon, enterprise-style coding tasks that take human engineers hours, not minutes.

10 models tested

EvasionBench

Top scorer: GLM-4.7 (82.9)

Sixteen thousand earnings-call Q&A pairs that test whether a model can spot when an executive is dodging the question.

2 models tested

olmOCR

Fourteen hundred real PDFs that test whether a model can turn messy documents into clean, structured markdown.

Top scorer: Kimi K2.6 (92.7)

HMMT 2026

text2026

Elite university-level competition math problems used as a 2026-fresh test of advanced reasoning.

9 models tested

AA Intelligence Index

Top scorer: GPT-5.5 (60.2)

Artificial Analysis composite score that blends a dozen reasoning, coding, and math benchmarks into a single number.

136 models tested

Top scorer: GPT-5 High (99.4)

MATH-500

text2021

Five hundred curated competition math problems used as a fast, repeatable test of mathematical reasoning.

53 models tested

AIME 2025

Fifteen elite high-school competition math problems from 2025, used as a clean test of fresh-year reasoning.

LiveCodeBench

Top scorer: Gemini 3 Pro (91.7)

Contamination-resistant coding benchmark drawn from competition problems posted after each model’s training cutoff.

86 models tested

SciCode

Top scorer: Gemini 3.1 Pro Preview (58.9)

Tests whether a model can write research code across physics, mathematics, biology, and chemistry.

131 models tested

Terminal Bench Hard

Top scorer: GPT-5.5 (60.6)

The harder tier of Terminal Bench, scored by Artificial Analysis as an agent stress test.

118 models tested

Global MMLU Lite

Lighter, multilingual variant of MMLU covering 14 languages and the original subject mix.

AA Omniscience

Broad-domain knowledge benchmark that tests recall across business, science, history, and culture.

MMMU-Pro

Hard image-plus-text reasoning across 30 college subjects, the multimodal counterpart to MMLU-Pro.

Top scorer: Grok 4.20 Beta 0309 Reasoning (82.9)

IFBench

text2023

Tests whether a model obeys precise formatting, length, and constraint rules.

121 models tested

AA LCR

Top scorer: GPT-5 High (75.6)

Tests reasoning over inputs from 10,000 to 100,000 tokens, well past what shorter benchmarks measure.

121 models tested

GDPval (AA)

Agent benchmark covering economically valuable knowledge-work tasks across professions.

APEX Agents (AA)

Multi-step agent benchmark focused on planning and tool use across business workflows.

Tau-2 Bench Telecom

Conversational agent benchmark in the telecom customer-support domain, where context and turn-taking matter.

Top scorer: Claude Opus 4.6 (Thinking) (100.0)

Arena Score

text2023

Open head-to-head human preference rankings for chat models, the most-watched live leaderboard in AI.

145 models tested

WebDev Arena

Head-to-head human preference ranking for models that turn natural-language prompts into working web apps.

Image-to-WebDev

Head-to-head ranking for models that turn a screenshot or mockup into a working web app.

Search Arena

Head-to-head ranking for models that answer real questions using web search and citations.

Vision Arena

Head-to-head ranking for vision-language models on real image-understanding prompts.

Document Arena

Head-to-head ranking for models that read PDFs, slides, and long screenshots to answer real questions.

Top scorer: GPT Image 2 (100.0)

Image Arena

image2024

Head-to-head human preference ranking for text-to-image and image-edit models, run by Arena.ai.

21 models tested

Top scorer: GPT Image 1.5 (100.0)

Image Edit Arena

image2024

Head-to-head ranking for models that edit an input image given a text instruction.

16 models tested

Top scorer: BAGEL-7B-MoT (88.0)

GenEval

image2023

Object-focused prompts that test whether a generator gets counts, positions, colors, and attributes right.

3 models tested

Top scorer: Stable Diffusion 3.5 Large (30.3)

HPS v2

image2023

A reward model that predicts what humans will prefer, trained on hundreds of thousands of real preference labels.

1 models tested

Top scorer: Stable Diffusion 3.5 Large (1.1)

ImageReward

image2023

A reward model that judges text-image alignment, fidelity, and aesthetic quality on a single combined score.

1 models tested

Video Arena

Top scorer: Seedance 2.0 (100.0)

Head-to-head human preference ranking for text-to-video, image-to-video, and video-edit models.

11 models tested

Image-to-Video Arena

Top scorer: Seedance 2.0 (100.0)

Head-to-head ranking for models that animate a still input image, with or without a text instruction.

8 models tested

Video Edit Arena

Head-to-head ranking for models that edit an input clip given a text instruction.

VBench

Top scorer: Wan2.2-T2V-A14B (86.2)

Sixteen-dimension benchmark covering temporal coherence, subject consistency, motion quality, and prompt fidelity.

2 models tested

Top scorer: Eleven v3 (100.0)

TTS Arena

audio2024

Blind head-to-head listening test for text-to-speech models, ranked by Bradley-Terry on pairwise wins.

12 models tested

Top scorer: Fish Speech v1.4 (1.0%)

WER

audio2023

The standard accuracy metric for speech-to-text: lower is better.

30 models tested

Top scorer: CosyVoice 2.0 (5.0 / 5)

MOS

audio1996

Classic 1–5 listener rating for TTS naturalness. The longest-running quality metric in speech.

9 models tested

MTEB Overall

Top scorer: harrier-oss-v1-27b (74.3)

The single number most teams quote when comparing embedding models. Aggregates 56 datasets across 8 task types.

18 models tested

MTEB Retrieval

The retrieval slice of MTEB. The most important sub-score if you are building RAG.

Top scorer: harrier-oss-v1-27b (78.3)

MTEB Classification

Tests whether the embedding captures enough semantic structure for downstream classifiers to work.

Top scorer: harrier-oss-v1-27b (80.0)

MTEB Clustering

Tests whether semantically similar items end up close together in the embedding space.

Top scorer: F2LLM-v2-14B (60.9)

MTEB STS

Measures whether the embedding distance between two sentences matches human similarity judgments.

Top scorer: Octen-Embedding-8B (81.3)