The 48 most-watched benchmarks for modern AI models across text, image, video, audio, and embedding. What each one measures, who scores highest, how scores spread, and which benchmarks correlate with which.
Benchmark data sourced from: Hugging Face, Arena.ai, Artificial Analysis.
An AI benchmark is a fixed test set with a public scoring rule that lets you compare different models on equal terms.
A good score on the right benchmark is a signal that a model can do real work in a specific domain. The wrong benchmark looks impressive on a slide and tells you nothing about whether the model will hold up in production.
Use this library to learn what each benchmark actually measures, who scores highest, how scores spread across open and closed models, and which benchmarks tend to move together. Pick two or three that match how you plan to use a model and ignore the rest.
A benchmark is a fixed set of questions, prompts, or tasks plus a scoring rule. The same model gets the same score every time it runs, so you can compare different models on equal terms. Good benchmarks isolate one skill, like graduate-level science reasoning or multi-step coding, so a number means something specific.
No single benchmark tells the whole story. For frontier reasoning, look at GPQA Diamond, HLE, and AIME. For coding, look at SWE-bench Verified and Terminal-Bench. For everyday assistant work, look at the LMSYS Arena score, which reflects real human preference. Pick the two or three that match how you plan to use the model and ignore the rest.
Scores are synced on a regular cron from the public leaderboards and from each lab’s model cards. When a new model lands or a leaderboard moves, this directory updates within a day. The "Last refreshed" stamp above shows when the latest sync ran.
Most numbers come from the source leaderboards run by Hugging Face, Arena.ai (formerly LMSYS), MTEB, Artificial Analysis, or the benchmark authors themselves. Scores reported by labs on their own model cards are included when no independent run exists, and they are clearly attributed on the detail page. We do not run benchmarks ourselves.
Mostly yes for the benchmarks shown here, with two caveats. First, training data contamination is a real risk on older benchmarks like MMLU; newer benchmarks like GPQA and HLE are designed to resist it. Second, labs sometimes prompt-engineer to a benchmark, so a 2-point edge is rarely meaningful. Treat scores as ranges, not exact ranks.
Closed labs train on more compute and have larger evaluation teams, so their flagships usually lead on reasoning and coding. Top open-source models are typically 6 to 12 months behind the frontier, and the gap is smaller on narrow tasks like math or code than on broad reasoning. The detail page for each benchmark plots this gap so you can see it directly.