Score 0–100 GPQA Gemini 3.1 Pro Preview · 94.1 (Google, United States) GPT-5.5 · 93.5 (OpenAI, United States) Qwen 3.7 Max · 92.3 (Alibaba, China) Gemini 3.5 Flash · 92.2 (Google, United States) GPT-5.4 · 92 (OpenAI, United States) GPT-5.4 High · 92 (OpenAI, United States) Claude Opus 4.7 · 91.4 (Anthropic, United States) Claude Opus 4.7 Thinking · 91.4 (Anthropic, United States) Kimi K2.6 · 91.1 (Moonshot AI, China) Gemini 3 Pro · 90.8 (Google, United States) GPT-5.2 · 90.3 (OpenAI, United States) GPT-5.2 High · 90.3 (OpenAI, United States) Grok 4.3 · 90.1 (xAI, United States) Grok 4.3 beta · 90.1 (xAI, United States) Gemini 3 Flash (Thinking Minimal) · 89.8 (Google, United States) Claude Opus 4.6 (Thinking) · 89.6 (Anthropic, United States) DeepSeek-V4-Flash · 89.4 (DeepSeek, China) Qwen3.5-397B-A17B · 89.3 (Alibaba, China) DeepSeek-V4-Pro · 88.8 (DeepSeek, China) Qwen3.6 Max Preview · 88.8 (Alibaba, China) Grok 4.20 Beta 0309 Reasoning · 88.5 (xAI, United States) Qwen3.6-Plus · 88.2 (Alibaba, China) Kimi K2.5 · 87.9 (Moonshot AI, China) Grok 4 (0709) · 87.7 (xAI, United States) GPT-5.4 Mini High · 87.5 (OpenAI, United States) MiniMax M2.7 · 87.4 (MiniMax, China) GPT-5.1 · 87.3 (OpenAI, United States) GPT-5.1 High · 87.3 (OpenAI, United States) MiMo v2 Pro · 87 (Xiaomi, China) GLM-5.1 · 86.8 (Z.ai, China) Claude Opus 4.5 (Thinking 32K) · 86.6 (Anthropic, United States) MiMo-V2.5-Pro · 86.6 (Xiaomi, China) GLM-4.7 · 85.9 (Z.ai, China) Qwen3.5-27B · 85.8 (Alibaba, China) Qwen3.5-122B-A10B · 85.7 (Alibaba, China) GPT-5 High · 85.4 (OpenAI, United States) Grok 4.1 Fast Reasoning · 85.3 (xAI, United States) MiMo-V2.5 · 84.9 (Xiaomi, China) minimax-m2.5 · 84.8 (MiniMax, China) Grok 4 Fast Reasoning · 84.7 (xAI, United States) Qwen3.5-35B-A3B · 84.5 (Alibaba, China) Gemini 2.5 Pro · 84.4 (Google, United States) Qwen3.6-27B · 84.2 (Alibaba, China) Qwen3.6 35B-A3B · 84.1 (Alibaba, China) Claude Opus 4.6 · 84 (Anthropic, United States) Kimi K2 Thinking · 83.8 (Moonshot AI, China) Claude Sonnet 4.5 (Thinking 32K) · 83.4 (Anthropic, United States) GPT-5 Mini High · 82.8 (OpenAI, United States) OpenAI o3 · 82.7 (OpenAI, United States) Gemini 3.1 Flash Lite Preview · 82.2 (Google, United States) GLM-5 · 82 (Z.ai, China) GPT-5.4 Nano High · 81.7 (OpenAI, United States) DeepSeek-R1 · 81.3 (DeepSeek, China) Gemini 3 Flash · 81.2 (Google, United States) Claude Opus 4.5 · 81 (Anthropic, United States) Claude Opus 4.1 (Thinking 16K) · 80.9 (Anthropic, United States) Qwen3.5-9B · 80.6 (Alibaba, China) Nvidia Nemotron 3 Super · 80 (NVIDIA, United States) Claude Sonnet 4.6 · 79.9 (Anthropic, United States) Claude Opus 4 (Thinking 16K) · 79.6 (Anthropic, United States) Grok 3 Mini High · 79.1 (xAI, United States) OpenAI o4-mini · 78.4 (OpenAI, United States) GLM-4.5 · 78.2 (Z.ai, China) Claude Sonnet 4 (Thinking 32K) · 77.7 (Anthropic, United States) OpenAI o3-mini High · 77.3 (OpenAI, United States) Claude 3.7 Sonnet (Thinking 32K) · 77.2 (Anthropic, United States) Kimi K2 Instruct 0905 · 76.7 (Moonshot AI, China) Gemini 2.5 Flash Preview 09-2025 · 76.6 (Google, United States) Kimi K2 Instruct · 76.6 (Moonshot AI, China) Qwen3 Max (2025-09-23) · 76.4 (Alibaba, China) Qwen3 Max Preview · 76.4 (Alibaba, China) Gemma 4 31B IT · 76.3 (Google, United States) DeepSeek-V3.2 · 75.1 (DeepSeek, China) OpenAI o3-mini · 74.8 (OpenAI, United States) OpenAI o1 · 74.7 (OpenAI, United States) DeepSeek-V3.1 · 73.5 (DeepSeek, China) Claude Sonnet 4.5 · 72.7 (Anthropic, United States) Gemma 4 26B-A4B IT · 71.4 (Google, United States) GPT-5.2 Chat Latest · 71.2 (OpenAI, United States) Claude Opus 4 · 70.1 (Anthropic, United States) Grok 3 Preview 02-24 · 69.3 (xAI, United States) GPT-5 Chat · 68.6 (OpenAI, United States) Claude Sonnet 4 · 68.3 (Anthropic, United States) Gemini 2.5 Flash · 68.3 (Google, United States) GPT-5 Nano High · 67.6 (OpenAI, United States) Llama 4 Maverick · 67.1 (Meta, United States) GPT-4.1 · 66.6 (OpenAI, United States) GPT-4.1 Mini · 66.4 (OpenAI, United States) ChatGPT-4o Latest (2025-03-26) · 65.5 (OpenAI, United States) Gemini 2.5 Flash Lite Preview 09-2025 (No Thinking) · 65.1 (Google, United States) Claude Haiku 4.5 · 64.6 (Anthropic, United States) Amazon Nova 2 Pro · 63.6 (Amazon, United States) GLM-4.6 · 63.2 (Z.ai, China) Gemini 2.5 Flash Lite Preview 06-17 (Thinking) · 62.5 (Google, United States) Gemini 2.0 Flash · 62.3 (Google, United States) Qwen3-235B-A22B · 61.3 (Alibaba, China) Grok 4 Fast Chat · 60.6 (xAI, United States) Amazon Nova 2 Lite · 60.3 (Amazon, United States) OpenAI o1-mini · 60.3 (OpenAI, United States) Llama 4 Scout · 58.7 (Meta, United States) Qwen2.5 Max · 58.7 (Alibaba, China) DeepSeek-V3 · 55.7 (DeepSeek, China) Gemma 4 E4B IT · 54.9 (Google, United States) GPT-4o · 54.3 (OpenAI, United States) Gemini 2.0 Flash Lite Preview · 54.2 (Google, United States) Qwen3-32B · 53.5 (Alibaba, China) Llama 3.1 405B Instruct · 51.5 (Meta, United States) Qwen3-30B-A3B · 51.5 (Alibaba, China) Grok 2 · 51 (xAI, United States) Llama 3.3 70B Instruct · 49.8 (Meta, United States) Nemotron 3 Nano Omni · 46.9 (NVIDIA, United States) Gemma 3 27B IT · 42.8 (Google, United States) Llama 3.1 70B Instruct · 40.9 (Meta, United States) Gemma 4 E2B IT · 40.5 (Google, United States) Llama 3 70B Instruct · 37.9 (Meta, United States) Llama 2 70B Chat · 32.7 (Meta, United States) Llama 2 13B Chat · 32.1 (Meta, United States) Llama 3 8B Instruct · 29.6 (Meta, United States) Gemma 3 4B IT · 29.1 (Google, United States) Llama 3.1 8B Instruct · 25.9 (Meta, United States) Llama 2 7B Chat · 22.7 (Meta, United States) 0–100 MMLU-PRO Gemini 3 Pro · 89.8 (Google, United States) Claude Opus 4.5 (Thinking 32K) · 89.5 (Anthropic, United States) Gemini 3 Flash (Thinking Minimal) · 89 (Google, United States) Claude Opus 4.5 · 88.9 (Anthropic, United States) Gemini 3 Flash · 88.2 (Google, United States) Claude Opus 4.1 (Thinking 16K) · 88 (Anthropic, United States) Qwen3.5-397B-A17B · 87.8 (Alibaba, China) Claude Sonnet 4.5 (Thinking 32K) · 87.5 (Anthropic, United States) DeepSeek-V4-Pro · 87.5 (DeepSeek, China) GPT-5.2 · 87.4 (OpenAI, United States) GPT-5.2 High · 87.4 (OpenAI, United States) Claude Opus 4 (Thinking 16K) · 87.3 (Anthropic, United States) GPT-5 High · 87.1 (OpenAI, United States) Kimi K2.5 · 87.1 (Moonshot AI, China) GPT-5.1 · 87 (OpenAI, United States) GPT-5.1 High · 87 (OpenAI, United States) Qwen3.5-122B-A10B · 86.7 (Alibaba, China) Grok 4 (0709) · 86.6 (xAI, United States) DeepSeek-V4-Flash · 86.4 (DeepSeek, China) Gemini 2.5 Pro · 86.2 (Google, United States) Qwen3.6-27B · 86.2 (Alibaba, China) Qwen3.5-27B · 86.1 (Alibaba, China) Claude Opus 4 · 86 (Anthropic, United States) Claude Sonnet 4.5 · 86 (Anthropic, United States) GLM-5 · 86 (Z.ai, China) GLM-4.7 · 85.6 (Z.ai, China) Grok 4.1 Fast Reasoning · 85.4 (xAI, United States) OpenAI o3 · 85.3 (OpenAI, United States) Qwen3.6 35B-A3B · 85.2 (Alibaba, China) Grok 4 Fast Reasoning · 85 (xAI, United States) DeepSeek-R1 · 84.9 (DeepSeek, China) Kimi K2 Thinking · 84.8 (Moonshot AI, China) Claude Sonnet 4 (Thinking 32K) · 84.2 (Anthropic, United States) OpenAI o1 · 84.1 (OpenAI, United States) Qwen3 Max (2025-09-23) · 84.1 (Alibaba, China) Qwen3 Max Preview · 83.8 (Alibaba, China) Nvidia Nemotron 3 Super · 83.7 (NVIDIA, United States) Claude 3.7 Sonnet (Thinking 32K) · 83.7 (Anthropic, United States) Claude Sonnet 4 · 83.7 (Anthropic, United States) DeepSeek-V3.2 · 83.7 (DeepSeek, China) GPT-5 Mini High · 83.7 (OpenAI, United States) Gemini 2.5 Flash Preview 09-2025 · 83.6 (Google, United States) GLM-4.5 · 83.5 (Z.ai, China) DeepSeek-V3.1 · 83.3 (DeepSeek, China) OpenAI o4-mini · 83.2 (OpenAI, United States) Grok 3 Mini High · 82.8 (xAI, United States) Kimi K2 Instruct · 82.4 (Moonshot AI, China) GPT-5 Chat · 82 (OpenAI, United States) Kimi K2 Instruct 0905 · 81.9 (Moonshot AI, China) GPT-5.2 Chat Latest · 81.4 (OpenAI, United States) Gemini 2.5 Flash · 80.9 (Google, United States) Llama 4 Maverick · 80.9 (Meta, United States) GPT-4.1 · 80.6 (OpenAI, United States) ChatGPT-4o Latest (2025-03-26) · 80.3 (OpenAI, United States) OpenAI o3-mini High · 80.2 (OpenAI, United States) minimax-m2.5 · 80.1 (MiniMax, China) Claude Haiku 4.5 · 80 (Anthropic, United States) Grok 3 Preview 02-24 · 79.9 (xAI, United States) Gemini 2.5 Flash Lite Preview 09-2025 (No Thinking) · 79.6 (Google, United States) OpenAI o3-mini · 79.1 (OpenAI, United States) GLM-4.6 · 78.4 (Z.ai, China) GPT-4.1 Mini · 78.1 (OpenAI, United States) GPT-5 Nano High · 78 (OpenAI, United States) Gemini 2.0 Flash · 77.9 (Google, United States) Amazon Nova 2 Pro · 77.2 (Amazon, United States) Qwen2.5 Max · 76.2 (Alibaba, China) Qwen3-235B-A22B · 76.2 (Alibaba, China) Gemini 2.5 Flash Lite Preview 06-17 (Thinking) · 75.9 (Google, United States) DeepSeek-V3 · 75.2 (DeepSeek, China) Llama 4 Scout · 75.2 (Meta, United States) GPT-4o · 74.8 (OpenAI, United States) Amazon Nova 2 Lite · 74.3 (Amazon, United States) OpenAI o1-mini · 74.2 (OpenAI, United States) Llama 3.1 405B Instruct · 73.2 (Meta, United States) Grok 4 Fast Chat · 73 (xAI, United States) Qwen3-32B · 72.7 (Alibaba, China) Llama 3.3 70B Instruct · 71.3 (Meta, United States) Qwen3-30B-A3B · 71 (Alibaba, China) Grok 2 · 70.9 (xAI, United States) Llama 3.1 70B Instruct · 67.6 (Meta, United States) Gemma 3 27B IT · 66.9 (Google, United States) Llama 3 70B Instruct · 57.4 (Meta, United States) Llama 3.1 8B Instruct · 47.6 (Meta, United States) Gemma 3 4B IT · 41.7 (Google, United States) Llama 2 13B Chat · 40.6 (Meta, United States) Llama 2 70B Chat · 40.6 (Meta, United States) Llama 3 8B Instruct · 40.5 (Meta, United States) Llama 2 7B Chat · 16.4 (Meta, United States) 91–96 GSM8K DeepSeek-V3 · 94.3 (DeepSeek, China) DeepSeek-V4-Pro · 92.6 (DeepSeek, China) 50–85 SWE-Verified DeepSeek-V4-Pro · 80.6 (DeepSeek, China) Kimi K2.6 · 80.2 (Moonshot AI, China) DeepSeek-V4-Flash · 79 (DeepSeek, China) Qwen3.6-27B · 77.2 (Alibaba, China) Qwen3.5-397B-A17B · 76.4 (Alibaba, China) minimax-m2.5 · 75.8 (MiniMax, China) GLM-4.7 · 73.8 (Z.ai, China) Qwen3.6 35B-A3B · 73.4 (Alibaba, China) GLM-5 · 72.8 (Z.ai, China) Qwen3.5-27B · 72.4 (Alibaba, China) Qwen3.5-122B-A10B · 72 (Alibaba, China) Kimi K2 Thinking · 71.3 (Moonshot AI, China) Kimi K2.5 · 70.8 (Moonshot AI, China) Nvidia Nemotron 3 Super · 53.7 (NVIDIA, United States) 0–50 HLE Gemini 3.1 Pro Preview · 44.7 (Google, United States) GPT-5.5 · 44.3 (OpenAI, United States) GPT-5.4 · 41.6 (OpenAI, United States) GPT-5.4 High · 41.6 (OpenAI, United States) Gemini 3.5 Flash · 41 (Google, United States) Claude Opus 4.7 · 39.6 (Anthropic, United States) Claude Opus 4.7 Thinking · 39.6 (Anthropic, United States) Qwen 3.7 Max · 38.1 (Alibaba, China) Gemini 3 Pro · 37.2 (Google, United States) Claude Opus 4.6 (Thinking) · 36.7 (Anthropic, United States) DeepSeek-V4-Pro · 35.9 (DeepSeek, China) Kimi K2.6 · 35.9 (Moonshot AI, China) GPT-5.2 · 35.4 (OpenAI, United States) GPT-5.2 High · 35.4 (OpenAI, United States) Grok 4.3 · 35 (xAI, United States) Grok 4.3 beta · 35 (xAI, United States) Gemini 3 Flash (Thinking Minimal) · 34.7 (Google, United States) MiMo-V2.5-Pro · 33.8 (Xiaomi, China) DeepSeek-V4-Flash · 32.1 (DeepSeek, China) Grok 4.20 Beta 0309 Reasoning · 30 (xAI, United States) Kimi K2.5 · 29.4 (Moonshot AI, China) Qwen3.6 Max Preview · 28.9 (Alibaba, China) Claude Opus 4.5 (Thinking 32K) · 28.4 (Anthropic, United States) MiMo v2 Pro · 28.3 (Xiaomi, China) MiniMax M2.7 · 28.1 (MiniMax, China) GLM-5.1 · 28 (Z.ai, China) Qwen3.5-397B-A17B · 27.3 (Alibaba, China) GLM-5 · 27.2 (Z.ai, China) GPT-5.4 Mini High · 26.6 (OpenAI, United States) GPT-5 High · 26.5 (OpenAI, United States) GPT-5.1 · 26.5 (OpenAI, United States) GPT-5.1 High · 26.5 (OpenAI, United States) GPT-5.4 Nano High · 26.5 (OpenAI, United States) Qwen3.6-Plus · 25.7 (Alibaba, China) MiMo-V2.5 · 25.2 (Xiaomi, China) GLM-4.7 · 25.1 (Z.ai, China) Grok 4 (0709) · 23.9 (xAI, United States) Qwen3.5-122B-A10B · 23.4 (Alibaba, China) Kimi K2 Thinking · 22.3 (Moonshot AI, China) Qwen3.5-27B · 22.2 (Alibaba, China) Qwen3.6-27B · 21.6 (Alibaba, China) Gemini 2.5 Pro · 21.1 (Google, United States) Qwen3.6 35B-A3B · 20.2 (Alibaba, China) OpenAI o3 · 20 (OpenAI, United States) GPT-5 Mini High · 19.7 (OpenAI, United States) Qwen3.5-35B-A3B · 19.7 (Alibaba, China) Nvidia Nemotron 3 Super · 19.2 (NVIDIA, United States) minimax-m2.5 · 19.1 (MiniMax, China) Claude Opus 4.6 · 18.6 (Anthropic, United States) Grok 4.1 Fast Reasoning · 17.6 (xAI, United States) OpenAI o4-mini · 17.5 (OpenAI, United States) Claude Sonnet 4.5 (Thinking 32K) · 17.3 (Anthropic, United States) Grok 4 Fast Reasoning · 17 (xAI, United States) Gemini 3.1 Flash Lite Preview · 16.2 (Google, United States) DeepSeek-R1 · 14.9 (DeepSeek, China) Gemini 3 Flash · 14.1 (Google, United States) Qwen3.5-9B · 13.3 (Alibaba, China) Claude Sonnet 4.6 · 13.2 (Anthropic, United States) Claude Opus 4.5 · 12.9 (Anthropic, United States) OpenAI o3-mini High · 12.3 (OpenAI, United States) GLM-4.5 · 12.2 (Z.ai, China) Claude Opus 4.1 (Thinking 16K) · 11.9 (Anthropic, United States) Claude Opus 4 (Thinking 16K) · 11.7 (Anthropic, United States) Gemma 4 31B IT · 11.5 (Google, United States) Grok 3 Mini High · 11.1 (xAI, United States) Qwen3 Max (2025-09-23) · 11.1 (Alibaba, China) Gemma 4 26B-A4B IT · 10.7 (Google, United States) DeepSeek-V3.2 · 10.5 (DeepSeek, China) Claude 3.7 Sonnet (Thinking 32K) · 10.3 (Anthropic, United States) Claude Sonnet 4 (Thinking 32K) · 9.6 (Anthropic, United States) Qwen3 Max Preview · 9.3 (Alibaba, China) OpenAI o3-mini · 8.7 (OpenAI, United States) GPT-5 Nano High · 8.2 (OpenAI, United States) Gemini 2.5 Flash Preview 09-2025 · 7.8 (Google, United States) OpenAI o1 · 7.7 (OpenAI, United States) GPT-5.2 Chat Latest · 7.3 (OpenAI, United States) Claude Sonnet 4.5 · 7.1 (Anthropic, United States) Kimi K2 Instruct · 7 (Moonshot AI, China) Gemini 2.5 Flash Lite Preview 06-17 (Thinking) · 6.4 (Google, United States) DeepSeek-V3.1 · 6.3 (DeepSeek, China) Kimi K2 Instruct 0905 · 6.3 (Moonshot AI, China) Claude Opus 4 · 5.9 (Anthropic, United States) GPT-5 Chat · 5.8 (OpenAI, United States) Llama 2 7B Chat · 5.8 (Meta, United States) Gemini 2.0 Flash · 5.3 (Google, United States) Nemotron 3 Nano Omni · 5.3 (NVIDIA, United States) Gemma 3 4B IT · 5.2 (Google, United States) GLM-4.6 · 5.2 (Z.ai, China) Gemini 2.5 Flash · 5.1 (Google, United States) Grok 3 Preview 02-24 · 5.1 (xAI, United States) Llama 3 8B Instruct · 5.1 (Meta, United States) Llama 3.1 8B Instruct · 5.1 (Meta, United States) ChatGPT-4o Latest (2025-03-26) · 5 (OpenAI, United States) Grok 4 Fast Chat · 5 (xAI, United States) Llama 2 70B Chat · 5 (Meta, United States) OpenAI o1-mini · 4.9 (OpenAI, United States) Llama 4 Maverick · 4.8 (Meta, United States) Gemma 3 27B IT · 4.7 (Google, United States) Gemma 4 E4B IT · 4.7 (Google, United States) Llama 2 13B Chat · 4.7 (Meta, United States) Qwen3-235B-A22B · 4.7 (Alibaba, China) Gemini 2.5 Flash Lite Preview 09-2025 (No Thinking) · 4.6 (Google, United States) GPT-4.1 · 4.6 (OpenAI, United States) GPT-4.1 Mini · 4.6 (OpenAI, United States) Llama 3.1 70B Instruct · 4.6 (Meta, United States) Qwen3-30B-A3B · 4.6 (Alibaba, China) Gemma 4 E2B IT · 4.5 (Google, United States) Qwen2.5 Max · 4.5 (Alibaba, China) Gemini 2.0 Flash Lite Preview · 4.4 (Google, United States) Llama 3 70B Instruct · 4.4 (Meta, United States) Claude Haiku 4.5 · 4.3 (Anthropic, United States) Llama 4 Scout · 4.3 (Meta, United States) Qwen3-32B · 4.3 (Alibaba, China) Llama 3.1 405B Instruct · 4.2 (Meta, United States) Amazon Nova 2 Pro · 4 (Amazon, United States) Claude Sonnet 4 · 4 (Anthropic, United States) Llama 3.3 70B Instruct · 4 (Meta, United States) Grok 2 · 3.8 (xAI, United States) DeepSeek-V3 · 3.6 (DeepSeek, China) GPT-4o · 3.3 (OpenAI, United States) Amazon Nova 2 Lite · 3 (Amazon, United States) 88–98 AIME 2026 Kimi K2.6 · 96.4 (Moonshot AI, China) GLM-5 · 95.8 (Z.ai, China) Kimi K2.5 · 95.8 (Moonshot AI, China) GLM-5.1 · 95.3 (Z.ai, China) Qwen3.6-27B · 94.1 (Alibaba, China) Qwen3.5-397B-A17B · 93.3 (Alibaba, China) Qwen3.6 35B-A3B · 92.7 (Alibaba, China) Qwen3.5-27B · 90.8 (Alibaba, China) Nvidia Nemotron 3 Super · 90 (NVIDIA, United States) 20–80 Terminal Bench DeepSeek-V4-Pro · 67.9 (DeepSeek, China) Kimi K2.6 · 66.7 (Moonshot AI, China) GLM-5.1 · 63.5 (Z.ai, China) Qwen3.6-27B · 59.3 (Alibaba, China) DeepSeek-V4-Flash · 56.9 (DeepSeek, China) Qwen3.5-397B-A17B · 52.5 (Alibaba, China) GLM-5 · 52.4 (Z.ai, China) Qwen3.6 35B-A3B · 51.5 (Alibaba, China) Qwen3.5-122B-A10B · 49.4 (Alibaba, China) Kimi K2.5 · 43.2 (Moonshot AI, China) Qwen3.5-27B · 41.6 (Alibaba, China) Kimi K2 Thinking · 35.7 (Moonshot AI, China) GLM-4.7 · 33.4 (Z.ai, China) Nvidia Nemotron 3 Super · 31 (NVIDIA, United States) Kimi K2 Instruct · 27.8 (Moonshot AI, China) GLM-4.6 · 24.5 (Z.ai, China) 0–70 SWE-Pro Kimi K2.6 · 58.6 (Moonshot AI, China) GLM-5.1 · 58.4 (Z.ai, China) DeepSeek-V4-Pro · 55.4 (DeepSeek, China) minimax-m2.5 · 55.4 (MiniMax, China) Qwen3.6-27B · 53.5 (Alibaba, China) Kimi K2.5 · 50.7 (Moonshot AI, China) Qwen3.6 35B-A3B · 49.5 (Alibaba, China) Kimi K2 Instruct · 27.7 (Moonshot AI, China) Qwen3-235B-A22B · 21.4 (Alibaba, China) GLM-4.6 · 9.7 (Z.ai, China) 65–85 EvasionBench GLM-4.7 · 82.9 (Z.ai, China) Kimi K2 Instruct 0905 · 66.7 (Moonshot AI, China) 75–95 HMMT 2026 Kimi K2.6 · 92.7 (Moonshot AI, China) Qwen3.5-397B-A17B · 87.9 (Alibaba, China) Kimi K2.5 · 87.1 (Moonshot AI, China) GLM-5 · 86.4 (Z.ai, China) Nvidia Nemotron 3 Super · 84.8 (NVIDIA, United States) Qwen3.6-27B · 84.3 (Alibaba, China) Qwen3.6 35B-A3B · 83.6 (Alibaba, China) GLM-5.1 · 82.6 (Z.ai, China) Qwen3.5-27B · 81.1 (Alibaba, China) 0–70 AA Intelligence Index GPT-5.5 · 60.2 (OpenAI, United States) Claude Opus 4.7 · 57.3 (Anthropic, United States) Claude Opus 4.7 Thinking · 57.3 (Anthropic, United States) Gemini 3.1 Pro Preview · 57.2 (Google, United States) GPT-5.4 · 56.8 (OpenAI, United States) GPT-5.4 High · 56.8 (OpenAI, United States) Qwen 3.7 Max · 56.6 (Alibaba, China) Gemini 3.5 Flash · 55.3 (Google, United States) Kimi K2.6 · 53.9 (Moonshot AI, China) MiMo-V2.5-Pro · 53.8 (Xiaomi, China) Grok 4.3 · 53.2 (xAI, United States) Grok 4.3 beta · 53.2 (xAI, United States) Claude Opus 4.6 (Thinking) · 52.9 (Anthropic, United States) Qwen3.6 Max Preview · 51.8 (Alibaba, China) DeepSeek-V4-Pro · 51.5 (DeepSeek, China) GLM-5.1 · 51.4 (Z.ai, China) GPT-5.2 · 51.3 (OpenAI, United States) GPT-5.2 High · 51.3 (OpenAI, United States) Qwen3.6-Plus · 50 (Alibaba, China) GLM-5 · 49.8 (Z.ai, China) Claude Opus 4.5 (Thinking 32K) · 49.7 (Anthropic, United States) MiniMax M2.7 · 49.6 (MiniMax, China) MiMo v2 Pro · 49.2 (Xiaomi, China) MiMo-V2.5 · 49 (Xiaomi, China) GPT-5.4 Mini High · 48.9 (OpenAI, United States) Grok 4.20 Beta 0309 Reasoning · 48.5 (xAI, United States) Gemini 3 Pro · 48.4 (Google, United States) GPT-5.1 · 47.7 (OpenAI, United States) GPT-5.1 High · 47.7 (OpenAI, United States) Kimi K2.5 · 46.8 (Moonshot AI, China) Claude Opus 4.6 · 46.5 (Anthropic, United States) DeepSeek-V4-Flash · 46.5 (DeepSeek, China) Gemini 3 Flash (Thinking Minimal) · 46.4 (Google, United States) Qwen3.6-27B · 45.8 (Alibaba, China) Qwen3.5-397B-A17B · 45 (Alibaba, China) GPT-5 High · 44.6 (OpenAI, United States) Claude Sonnet 4.6 · 44.4 (Anthropic, United States) GPT-5.4 Nano High · 44 (OpenAI, United States) Qwen3.6 35B-A3B · 43.5 (Alibaba, China) Claude Opus 4.5 · 43.1 (Anthropic, United States) Claude Sonnet 4.5 (Thinking 32K) · 43 (Anthropic, United States) GLM-4.7 · 42.1 (Z.ai, China) Qwen3.5-27B · 42.1 (Alibaba, China) Claude Opus 4.1 (Thinking 16K) · 42 (Anthropic, United States) minimax-m2.5 · 41.9 (MiniMax, China) Qwen3.5-122B-A10B · 41.6 (Alibaba, China) Grok 4 (0709) · 41.5 (xAI, United States) GPT-5 Mini High · 41.2 (OpenAI, United States) Kimi K2 Thinking · 40.9 (Moonshot AI, China) Claude Opus 4 (Thinking 16K) · 39 (Anthropic, United States) Claude Sonnet 4 (Thinking 32K) · 38.7 (Anthropic, United States) Grok 4.1 Fast Reasoning · 38.6 (xAI, United States) OpenAI o3 · 38.4 (OpenAI, United States) Claude Sonnet 4.5 · 37.1 (Anthropic, United States) Qwen3.5-35B-A3B · 37.1 (Alibaba, China) Claude Opus 4.1 · 36 (Anthropic, United States) Nvidia Nemotron 3 Super · 36 (NVIDIA, United States) Grok 4 Fast Reasoning · 35.1 (xAI, United States) Gemini 3 Flash · 35 (Google, United States) Claude 3.7 Sonnet (Thinking 32K) · 34.7 (Anthropic, United States) Gemini 2.5 Pro · 34.6 (Google, United States) GPT-5.2 Chat Latest · 33.6 (OpenAI, United States) Gemini 3.1 Flash Lite Preview · 33.5 (Google, United States) OpenAI o4-mini · 33.1 (OpenAI, United States) Claude Opus 4 · 33 (Anthropic, United States) Claude Sonnet 4 · 33 (Anthropic, United States) Qwen3.5-9B · 32.4 (Alibaba, China) Gemma 4 31B IT · 32.3 (Google, United States) DeepSeek-V3.2 · 32.1 (DeepSeek, China) Grok 3 Mini High · 32.1 (xAI, United States) Qwen3 Max (2025-09-23) · 31.4 (Alibaba, China) Claude Haiku 4.5 · 31 (Anthropic, United States) Kimi K2 Instruct 0905 · 30.9 (Moonshot AI, China) OpenAI o1 · 30.7 (OpenAI, United States) GLM-4.6 · 30.2 (Z.ai, China) DeepSeek-V3.1 · 28.1 (DeepSeek, China) DeepSeek-R1 · 27.1 (DeepSeek, China) Gemma 4 26B-A4B IT · 27.1 (Google, United States) GPT-5 Nano High · 26.8 (OpenAI, United States) GLM-4.5 · 26.4 (Z.ai, China) GPT-4.1 · 26.3 (OpenAI, United States) Kimi K2 Instruct · 26.3 (Moonshot AI, China) Qwen3 Max Preview · 26.1 (Alibaba, China) OpenAI o3-mini · 25.9 (OpenAI, United States) Gemini 2.5 Flash Preview 09-2025 · 25.7 (Google, United States) Grok 3 Preview 02-24 · 25.2 (xAI, United States) OpenAI o3-mini High · 25.2 (OpenAI, United States) OpenAI o1 Preview · 23.7 (OpenAI, United States) Amazon Nova 2 Pro · 23.1 (Amazon, United States) Grok 4 Fast Chat · 23.1 (xAI, United States) GPT-4.1 Mini · 22.9 (OpenAI, United States) GPT-5 Chat · 21.8 (OpenAI, United States) Nemotron 3 Nano Omni · 21.4 (NVIDIA, United States) Gemini 2.5 Flash · 20.6 (Google, United States) OpenAI o1-mini · 20.4 (OpenAI, United States) GPT-4.5 Preview · 20 (OpenAI, United States) Gemini 2.5 Flash Lite Preview 09-2025 (No Thinking) · 19.4 (Google, United States) ChatGPT-4o Latest (2025-03-26) · 18.6 (OpenAI, United States) Gemini 2.0 Flash · 18.5 (Google, United States) Llama 4 Maverick · 18.4 (Meta, United States) Amazon Nova 2 Lite · 18 (Amazon, United States) Gemini 2.5 Flash Lite Preview 06-17 (Thinking) · 17.6 (Google, United States) Llama 3.1 405B Instruct · 17.4 (Meta, United States) GPT-4o · 17.3 (OpenAI, United States) Qwen3-235B-A22B · 17 (Alibaba, China) DeepSeek-V3 · 16.5 (DeepSeek, China) Qwen2.5 Max · 16.3 (Alibaba, China) Gemma 4 E4B IT · 14.8 (Google, United States) Gemini 2.0 Flash Lite Preview · 14.5 (Google, United States) Llama 3.3 70B Instruct · 14.5 (Meta, United States) Qwen3-32B · 14.5 (Alibaba, China) Grok 2 · 13.9 (xAI, United States) Llama 4 Scout · 13.5 (Meta, United States) Llama 3.1 70B Instruct · 12.5 (Meta, United States) Qwen3-30B-A3B · 12.5 (Alibaba, China) Gemma 4 E2B IT · 12.1 (Google, United States) Llama 3.1 8B Instruct · 11.8 (Meta, United States) Gemma 3 27B IT · 10.3 (Google, United States) Llama 2 7B Chat · 9.7 (Meta, United States) Llama 3 70B Instruct · 8.9 (Meta, United States) Llama 2 13B Chat · 8.4 (Meta, United States) Llama 2 70B Chat · 8.4 (Meta, United States) LLaMA 65B · 7.4 (Meta, United States) Llama 3 8B Instruct · 6.4 (Meta, United States) Gemma 3 4B IT · 6.3 (Google, United States) 0–100 MATH-500 GPT-5 High · 99.4 (OpenAI, United States) Grok 3 Mini High · 99.2 (xAI, United States) OpenAI o3 · 99.2 (OpenAI, United States) Claude Sonnet 4 (Thinking 32K) · 99.1 (Anthropic, United States) Grok 4 (0709) · 99 (xAI, United States) OpenAI o4-mini · 98.9 (OpenAI, United States) OpenAI o3-mini High · 98.5 (OpenAI, United States) DeepSeek-R1 · 98.3 (DeepSeek, China) Claude Opus 4 (Thinking 16K) · 98.2 (Anthropic, United States) GLM-4.5 · 97.9 (Z.ai, China) OpenAI o3-mini · 97.3 (OpenAI, United States) Kimi K2 Instruct · 97.1 (Moonshot AI, China) OpenAI o1 · 97 (OpenAI, United States) Gemini 2.5 Flash Lite Preview 06-17 (Thinking) · 96.9 (Google, United States) Gemini 2.5 Pro · 96.7 (Google, United States) Claude 3.7 Sonnet (Thinking 32K) · 94.7 (Anthropic, United States) OpenAI o1-mini · 94.4 (OpenAI, United States) Claude Opus 4 · 94.1 (Anthropic, United States) Claude Sonnet 4 · 93.4 (Anthropic, United States) Gemini 2.5 Flash · 93.2 (Google, United States) Gemini 2.0 Flash · 93 (Google, United States) GPT-4.1 Mini · 92.5 (OpenAI, United States) OpenAI o1 Preview · 92.4 (OpenAI, United States) GPT-4.1 · 91.3 (OpenAI, United States) Qwen3-235B-A22B · 90.2 (Alibaba, China) ChatGPT-4o Latest (2025-03-26) · 89.3 (OpenAI, United States) Llama 4 Maverick · 88.9 (Meta, United States) DeepSeek-V3 · 88.7 (DeepSeek, China) Gemma 3 27B IT · 88.3 (Google, United States) Gemini 2.0 Flash Lite Preview · 87.3 (Google, United States) Grok 3 Preview 02-24 · 87 (xAI, United States) Qwen3-32B · 86.9 (Alibaba, China) Qwen3-30B-A3B · 86.3 (Alibaba, China) Llama 4 Scout · 84.4 (Meta, United States) Qwen2.5 Max · 83.5 (Alibaba, China) Grok 2 · 77.8 (xAI, United States) Llama 3.3 70B Instruct · 77.3 (Meta, United States) Gemma 3 4B IT · 76.6 (Google, United States) GPT-4o · 75.9 (OpenAI, United States) Llama 3.1 405B Instruct · 70.3 (Meta, United States) Llama 3.1 70B Instruct · 64.9 (Meta, United States) Llama 3.1 8B Instruct · 51.9 (Meta, United States) Llama 3 8B Instruct · 49.9 (Meta, United States) Llama 3 70B Instruct · 48.3 (Meta, United States) Llama 2 13B Chat · 32.9 (Meta, United States) Llama 2 70B Chat · 32.3 (Meta, United States) Llama 2 7B Chat · 5.9 (Meta, United States) 0–100 LiveCodeBench Gemini 3 Pro · 91.7 (Google, United States) Gemini 3 Flash (Thinking Minimal) · 90.8 (Google, United States) GLM-4.7 · 89.4 (Z.ai, China) GPT-5.2 · 88.9 (OpenAI, United States) GPT-5.2 High · 88.9 (OpenAI, United States) Claude Opus 4.5 (Thinking 32K) · 87.1 (Anthropic, United States) GPT-5.1 · 86.8 (OpenAI, United States) GPT-5.1 High · 86.8 (OpenAI, United States) OpenAI o4-mini · 85.9 (OpenAI, United States) Kimi K2 Thinking · 85.3 (Moonshot AI, China) GPT-5 High · 84.6 (OpenAI, United States) GPT-5 Mini High · 83.8 (OpenAI, United States) Grok 4 Fast Reasoning · 83.2 (xAI, United States) Grok 4.1 Fast Reasoning · 82.2 (xAI, United States) Grok 4 (0709) · 81.9 (xAI, United States) OpenAI o3 · 80.8 (OpenAI, United States) Gemini 2.5 Pro · 80.1 (Google, United States) Gemini 3 Flash · 79.7 (Google, United States) GPT-5 Nano High · 78.9 (OpenAI, United States) DeepSeek-R1 · 77 (DeepSeek, China) Qwen3 Max (2025-09-23) · 76.7 (Alibaba, China) Claude Opus 4.5 · 73.8 (Anthropic, United States) GLM-4.5 · 73.8 (Z.ai, China) OpenAI o3-mini High · 73.4 (OpenAI, United States) OpenAI o3-mini · 71.7 (OpenAI, United States) Claude Sonnet 4.5 (Thinking 32K) · 71.4 (Anthropic, United States) Grok 3 Mini High · 69.6 (xAI, United States) OpenAI o1 · 67.9 (OpenAI, United States) GPT-5.2 Chat Latest · 66.9 (OpenAI, United States) Claude Sonnet 4 (Thinking 32K) · 65.5 (Anthropic, United States) Claude Opus 4.1 (Thinking 16K) · 65.4 (Anthropic, United States) Qwen3 Max Preview · 65.1 (Alibaba, China) Gemini 2.5 Flash Lite Preview 09-2025 (No Thinking) · 64.1 (Google, United States) Claude Opus 4 (Thinking 16K) · 63.6 (Anthropic, United States) Gemini 2.5 Flash Preview 09-2025 · 62.5 (Google, United States) Kimi K2 Instruct 0905 · 61 (Moonshot AI, China) DeepSeek-V3.2 · 59.3 (DeepSeek, China) Gemini 2.5 Flash Lite Preview 06-17 (Thinking) · 59.3 (Google, United States) Claude Sonnet 4.5 · 59 (Anthropic, United States) DeepSeek-V3.1 · 57.7 (DeepSeek, China) OpenAI o1-mini · 57.6 (OpenAI, United States) GLM-4.6 · 56.1 (Z.ai, China) Kimi K2 Instruct · 55.6 (Moonshot AI, China) GPT-5 Chat · 54.3 (OpenAI, United States) Claude Opus 4 · 54.2 (Anthropic, United States) Claude Haiku 4.5 · 51.1 (Anthropic, United States) Gemini 2.5 Flash · 49.5 (Google, United States) GPT-4.1 Mini · 48.3 (OpenAI, United States) Amazon Nova 2 Pro · 47.3 (Amazon, United States) Claude 3.7 Sonnet (Thinking 32K) · 47.3 (Anthropic, United States) GPT-4.1 · 45.7 (OpenAI, United States) Claude Sonnet 4 · 44.9 (Anthropic, United States) ChatGPT-4o Latest (2025-03-26) · 42.5 (OpenAI, United States) Grok 3 Preview 02-24 · 42.5 (xAI, United States) Grok 4 Fast Chat · 40.1 (xAI, United States) Llama 4 Maverick · 39.7 (Meta, United States) DeepSeek-V3 · 35.9 (DeepSeek, China) Qwen2.5 Max · 35.9 (Alibaba, China) Amazon Nova 2 Lite · 34.6 (Amazon, United States) Qwen3-235B-A22B · 34.3 (Alibaba, China) Gemini 2.0 Flash · 33.4 (Google, United States) Qwen3-30B-A3B · 32.2 (Alibaba, China) GPT-4o · 30.9 (OpenAI, United States) Llama 3.1 405B Instruct · 30.5 (Meta, United States) Llama 4 Scout · 29.9 (Meta, United States) Llama 3.3 70B Instruct · 28.8 (Meta, United States) Qwen3-32B · 28.8 (Alibaba, China) Grok 2 · 26.7 (xAI, United States) Llama 3.1 70B Instruct · 23.2 (Meta, United States) Llama 3 70B Instruct · 19.8 (Meta, United States) Gemini 2.0 Flash Lite Preview · 17.9 (Google, United States) Gemma 3 27B IT · 13.7 (Google, United States) Llama 3.1 8B Instruct · 11.6 (Meta, United States) Gemma 3 4B IT · 11.2 (Google, United States) Llama 2 13B Chat · 9.8 (Meta, United States) Llama 2 70B Chat · 9.8 (Meta, United States) Llama 3 8B Instruct · 9.6 (Meta, United States) Llama 2 7B Chat · 0.2 (Meta, United States) 0–70 SciCode Gemini 3.1 Pro Preview · 58.9 (Google, United States) GPT-5.4 · 56.6 (OpenAI, United States) GPT-5.4 High · 56.6 (OpenAI, United States) Gemini 3 Pro · 56.1 (Google, United States) GPT-5.5 · 56.1 (OpenAI, United States) Claude Opus 4.7 · 54.5 (Anthropic, United States) Claude Opus 4.7 Thinking · 54.5 (Anthropic, United States) Kimi K2.6 · 53.5 (Moonshot AI, China) Gemini 3.5 Flash · 53.1 (Google, United States) GPT-5.2 · 52.1 (OpenAI, United States) GPT-5.2 High · 52.1 (OpenAI, United States) Claude Opus 4.6 (Thinking) · 51.9 (Anthropic, United States) Gemini 3 Flash (Thinking Minimal) · 50.6 (Google, United States) MiMo-V2.5-Pro · 50.2 (Xiaomi, China) DeepSeek-V4-Pro · 50 (DeepSeek, China) Gemini 3 Flash · 49.9 (Google, United States) GPT-5.4 Mini High · 49.9 (OpenAI, United States) Claude Opus 4.5 (Thinking 32K) · 49.5 (Anthropic, United States) Kimi K2.5 · 49 (Moonshot AI, China) Qwen 3.7 Max · 48.8 (Alibaba, China) Grok 4.3 · 47.3 (xAI, United States) Grok 4.3 beta · 47.3 (xAI, United States) Claude Opus 4.5 · 47 (Anthropic, United States) MiniMax M2.7 · 47 (MiniMax, China) Claude Sonnet 4.6 · 46.9 (Anthropic, United States) GPT-5.4 Nano High · 46.9 (OpenAI, United States) Qwen3.6 Max Preview · 46.9 (Alibaba, China) OpenAI o4-mini · 46.5 (OpenAI, United States) GLM-5 · 46.2 (Z.ai, China) Claude Opus 4.6 · 45.7 (Anthropic, United States) Grok 4 (0709) · 45.7 (xAI, United States) GLM-4.7 · 45.1 (Z.ai, China) DeepSeek-V4-Flash · 44.9 (DeepSeek, China) Claude Sonnet 4.5 (Thinking 32K) · 44.7 (Anthropic, United States) Grok 4.20 Beta 0309 Reasoning · 44.7 (xAI, United States) Grok 4 Fast Reasoning · 44.2 (xAI, United States) Grok 4.1 Fast Reasoning · 44.2 (xAI, United States) GLM-5.1 · 43.8 (Z.ai, China) GPT-5.1 · 43.3 (OpenAI, United States) GPT-5.1 High · 43.3 (OpenAI, United States) MiMo-V2.5 · 43.1 (Xiaomi, China) GPT-5 High · 42.9 (OpenAI, United States) Claude Sonnet 4.5 · 42.8 (Anthropic, United States) Gemini 2.5 Pro · 42.8 (Google, United States) minimax-m2.5 · 42.6 (MiniMax, China) MiMo v2 Pro · 42.5 (Xiaomi, China) Kimi K2 Thinking · 42.4 (Moonshot AI, China) Qwen3.5-122B-A10B · 42 (Alibaba, China) Qwen3.5-397B-A17B · 42 (Alibaba, China) Gemini 3.1 Flash Lite Preview · 41.9 (Google, United States) Gemma 4 31B IT · 41.1 (Google, United States) OpenAI o3 · 41 (OpenAI, United States) Claude Opus 4 · 40.9 (Anthropic, United States) Claude Opus 4.1 (Thinking 16K) · 40.9 (Anthropic, United States) Qwen3.6-Plus · 40.7 (Alibaba, China) Grok 3 Mini High · 40.6 (xAI, United States) GPT-4.1 Mini · 40.4 (OpenAI, United States) GPT-5.2 Chat Latest · 40.4 (OpenAI, United States) Claude 3.7 Sonnet (Thinking 32K) · 40.3 (Anthropic, United States) DeepSeek-R1 · 40.3 (DeepSeek, China) Claude Sonnet 4 (Thinking 32K) · 40 (Anthropic, United States) OpenAI o3-mini · 39.9 (OpenAI, United States) Claude Opus 4 (Thinking 16K) · 39.8 (Anthropic, United States) OpenAI o3-mini High · 39.8 (OpenAI, United States) Qwen3.6-27B · 39.8 (Alibaba, China) Qwen3.5-27B · 39.5 (Alibaba, China) GPT-5 Mini High · 39.2 (OpenAI, United States) DeepSeek-V3.2 · 38.7 (DeepSeek, China) Qwen3 Max (2025-09-23) · 38.3 (Alibaba, China) GPT-4.1 · 38.1 (OpenAI, United States) GPT-5 Chat · 37.8 (OpenAI, United States) Qwen3.5-35B-A3B · 37.7 (Alibaba, China) Gemini 2.5 Flash Preview 09-2025 · 37.5 (Google, United States) Claude Sonnet 4 · 37.3 (Anthropic, United States) Gemma 4 26B-A4B IT · 37.3 (Google, United States) Qwen3 Max Preview · 37 (Alibaba, China) Grok 3 Preview 02-24 · 36.8 (xAI, United States) DeepSeek-V3.1 · 36.7 (DeepSeek, China) ChatGPT-4o Latest (2025-03-26) · 36.6 (OpenAI, United States) GPT-5 Nano High · 36.6 (OpenAI, United States) Nvidia Nemotron 3 Super · 36 (NVIDIA, United States) OpenAI o1 · 35.8 (OpenAI, United States) Qwen3.6 35B-A3B · 35.8 (Alibaba, China) DeepSeek-V3 · 35.4 (DeepSeek, China) GLM-4.5 · 34.8 (Z.ai, China) Kimi K2 Instruct · 34.5 (Moonshot AI, China) Claude Haiku 4.5 · 34.4 (Anthropic, United States) Qwen2.5 Max · 33.7 (Alibaba, China) Gemini 2.0 Flash · 33.3 (Google, United States) GPT-4o · 33.3 (OpenAI, United States) GLM-4.6 · 33.1 (Z.ai, China) Llama 4 Maverick · 33.1 (Meta, United States) Grok 4 Fast Chat · 32.9 (xAI, United States) OpenAI o1-mini · 32.3 (OpenAI, United States) Kimi K2 Instruct 0905 · 30.7 (Moonshot AI, China) Llama 3.1 405B Instruct · 29.9 (Meta, United States) Qwen3-235B-A22B · 29.9 (Alibaba, China) Gemini 2.5 Flash · 29.1 (Google, United States) Gemini 2.5 Flash Lite Preview 09-2025 (No Thinking) · 28.5 (Google, United States) Grok 2 · 28.5 (xAI, United States) Amazon Nova 2 Pro · 28.1 (Amazon, United States) Qwen3-32B · 28 (Alibaba, China) Nemotron 3 Nano Omni · 27.8 (NVIDIA, United States) Qwen3.5-9B · 27.5 (Alibaba, China) Llama 3.1 70B Instruct · 26.7 (Meta, United States) Qwen3-30B-A3B · 26.4 (Alibaba, China) Llama 3.3 70B Instruct · 26 (Meta, United States) Gemini 2.0 Flash Lite Preview · 24.7 (Google, United States) Amazon Nova 2 Lite · 24 (Amazon, United States) Gemma 3 27B IT · 21.2 (Google, United States) Gemma 4 E2B IT · 20.4 (Google, United States) Gemini 2.5 Flash Lite Preview 06-17 (Thinking) · 19.3 (Google, United States) Llama 3 70B Instruct · 18.9 (Meta, United States) Llama 4 Scout · 17 (Meta, United States) Llama 3.1 8B Instruct · 13.2 (Meta, United States) Llama 3 8B Instruct · 11.9 (Meta, United States) Llama 2 13B Chat · 11.8 (Meta, United States) Gemma 3 4B IT · 7.3 (Google, United States) Gemma 4 E4B IT · 3.9 (Google, United States) Llama 2 7B Chat · 0 (Meta, United States) 0–70 Terminal Bench Hard GPT-5.5 · 60.6 (OpenAI, United States) GPT-5.4 · 57.6 (OpenAI, United States) GPT-5.4 High · 57.6 (OpenAI, United States) Gemini 3.1 Pro Preview · 53.8 (Google, United States) GPT-5.4 Mini High · 52.3 (OpenAI, United States) Claude Opus 4.7 · 51.5 (Anthropic, United States) Claude Opus 4.7 Thinking · 51.5 (Anthropic, United States) Qwen 3.7 Max · 50.8 (Alibaba, China) Claude Opus 4.6 · 48.5 (Anthropic, United States) Claude Opus 4.5 (Thinking 32K) · 47.0 (Anthropic, United States) GPT-5.2 · 47.0 (OpenAI, United States) GPT-5.2 High · 47.0 (OpenAI, United States) Claude Opus 4.6 (Thinking) · 46.2 (Anthropic, United States) Claude Sonnet 4.6 · 46.2 (Anthropic, United States) DeepSeek-V4-Pro · 46.2 (DeepSeek, China) GPT-5.1 · 45.5 (OpenAI, United States) GPT-5.1 High · 45.5 (OpenAI, United States) Kimi K2.6 · 43.9 (Moonshot AI, China) Qwen3.6 Max Preview · 43.9 (Alibaba, China) Qwen3.6-Plus · 43.9 (Alibaba, China) GLM-5 · 43.2 (Z.ai, China) GLM-5.1 · 43.2 (Z.ai, China) MiMo-V2.5-Pro · 43.2 (Xiaomi, China) GPT-5.4 Nano High · 42.4 (OpenAI, United States) Gemini 3 Pro · 41.7 (Google, United States) MiMo-V2.5 · 41.7 (Xiaomi, China) Claude Opus 4.5 · 40.9 (Anthropic, United States) Gemini 3.5 Flash · 40.9 (Google, United States) Grok 4.20 Beta 0309 Reasoning · 40.9 (xAI, United States) MiMo v2 Pro · 40.9 (Xiaomi, China) Qwen3.5-397B-A17B · 40.9 (Alibaba, China) MiniMax M2.7 · 39.4 (MiniMax, China) Gemini 3 Flash (Thinking Minimal) · 38.6 (Google, United States) Grok 4 (0709) · 37.9 (xAI, United States) Grok 4.3 · 37.9 (xAI, United States) Grok 4.3 beta · 37.9 (xAI, United States) OpenAI o3 · 37.1 (OpenAI, United States) Claude Sonnet 4.5 (Thinking 32K) · 35.6 (Anthropic, United States) DeepSeek-V4-Flash · 35.6 (DeepSeek, China) Kimi K2.5 · 34.9 (Moonshot AI, China) minimax-m2.5 · 34.9 (MiniMax, China) Qwen3.6 35B-A3B · 34.9 (Alibaba, China) Qwen3.6-27B · 34.9 (Alibaba, China) Claude Opus 4.1 (Thinking 16K) · 34.3 (Anthropic, United States) GPT-5 Mini High · 33.3 (OpenAI, United States) DeepSeek-V3.2 · 32.6 (DeepSeek, China) GPT-5 High · 32.6 (OpenAI, United States) Qwen3.5-27B · 32.6 (Alibaba, China) Gemini 3 Flash · 31.8 (Google, United States) GLM-4.7 · 31.8 (Z.ai, China) GPT-5.2 Chat Latest · 31.8 (OpenAI, United States) Claude Opus 4 (Thinking 16K) · 31.1 (Anthropic, United States) Claude Sonnet 4 (Thinking 32K) · 31.1 (Anthropic, United States) Kimi K2 Thinking · 31.1 (Moonshot AI, China) Qwen3.5-122B-A10B · 31.1 (Alibaba, China) Gemma 4 31B IT · 30.3 (Google, United States) Claude Sonnet 4.5 · 28.8 (Anthropic, United States) GLM-4.6 · 28.8 (Z.ai, China) Nvidia Nemotron 3 Super · 28.8 (NVIDIA, United States) Claude Haiku 4.5 · 27.3 (Anthropic, United States) Claude Sonnet 4 · 27.3 (Anthropic, United States) Gemini 2.5 Pro · 26.5 (Google, United States) Qwen3.5-35B-A3B · 26.5 (Alibaba, China) Gemma 4 26B-A4B IT · 25 (Google, United States) DeepSeek-V3.1 · 24.2 (DeepSeek, China) Gemini 3.1 Flash Lite Preview · 24.2 (Google, United States) Grok 4.1 Fast Reasoning · 24.2 (xAI, United States) Qwen3.5-9B · 24.2 (Alibaba, China) Kimi K2 Instruct 0905 · 23.5 (Moonshot AI, China) GLM-4.5 · 22.0 (Z.ai, China) Claude 3.7 Sonnet (Thinking 32K) · 21.2 (Anthropic, United States) Qwen3 Max (2025-09-23) · 20.4 (Alibaba, China) Qwen3 Max Preview · 19.7 (Alibaba, China) Grok 4 Fast Reasoning · 18.9 (xAI, United States) Grok 3 Mini High · 17.4 (xAI, United States) Amazon Nova 2 Pro · 16.7 (Amazon, United States) DeepSeek-R1 · 15.9 (DeepSeek, China) Kimi K2 Instruct · 15.9 (Moonshot AI, China) OpenAI o4-mini · 15.2 (OpenAI, United States) Gemini 2.5 Flash Preview 09-2025 · 14.4 (Google, United States) GPT-4.1 · 13.6 (OpenAI, United States) GPT-5 Chat · 12.9 (OpenAI, United States) OpenAI o1 · 12.9 (OpenAI, United States) Gemini 2.5 Flash · 12.1 (Google, United States) GPT-5 Nano High · 12.1 (OpenAI, United States) Grok 4 Fast Chat · 12.1 (xAI, United States) Grok 3 Preview 02-24 · 11.4 (xAI, United States) GPT-4o · 8.3 (OpenAI, United States) Nemotron 3 Nano Omni · 8.3 (NVIDIA, United States) Gemini 2.5 Flash Lite Preview 09-2025 (No Thinking) · 7.6 (Google, United States) Gemma 4 E4B IT · 7.6 (Google, United States) GPT-4.1 Mini · 7.6 (OpenAI, United States) Amazon Nova 2 Lite · 6.8 (Amazon, United States) DeepSeek-V3 · 6.8 (DeepSeek, China) Llama 3.1 405B Instruct · 6.8 (Meta, United States) Llama 4 Maverick · 6.8 (Meta, United States) OpenAI o3-mini · 6.8 (OpenAI, United States) Qwen3-30B-A3B · 6.8 (Alibaba, China) OpenAI o3-mini High · 6.1 (OpenAI, United States) Qwen3-235B-A22B · 6.1 (Alibaba, China) Gemini 2.5 Flash Lite Preview 06-17 (Thinking) · 4.5 (Google, United States) Gemini 2.0 Flash · 3.8 (Google, United States) Gemma 3 27B IT · 3.8 (Google, United States) Llama 3.1 70B Instruct · 3.0 (Meta, United States) Llama 3.3 70B Instruct · 3.0 (Meta, United States) Gemma 4 E2B IT · 2.3 (Google, United States) Llama 4 Scout · 1.5 (Meta, United States) Gemma 3 4B IT · 0.8 (Google, United States) Llama 3 70B Instruct · 0.8 (Meta, United States) Llama 3.1 8B Instruct · 0.8 (Meta, United States) Llama 3 8B Instruct · 0 (Meta, United States) 10–90 IFBench Grok 4.20 Beta 0309 Reasoning · 82.9 (xAI, United States) Grok 4.3 · 81.3 (xAI, United States) Grok 4.3 beta · 81.3 (xAI, United States) Qwen 3.7 Max · 80.5 (Alibaba, China) MiMo-V2.5-Pro · 79.9 (Xiaomi, China) DeepSeek-V4-Flash · 79.2 (DeepSeek, China) Qwen3.5-397B-A17B · 78.8 (Alibaba, China) Gemini 3 Flash (Thinking Minimal) · 78.0 (Google, United States) Gemini 3.1 Flash Lite Preview · 77.2 (Google, United States) Gemini 3.1 Pro Preview · 77.1 (Google, United States) Qwen3.6 Max Preview · 76.6 (Alibaba, China) DeepSeek-V4-Pro · 76.5 (DeepSeek, China) Gemini 3.5 Flash · 76.3 (Google, United States) GLM-5.1 · 76.3 (Z.ai, China) Kimi K2.6 · 76.0 (Moonshot AI, China) GPT-5.4 Nano High · 75.9 (OpenAI, United States) GPT-5.5 · 75.8 (OpenAI, United States) MiniMax M2.7 · 75.7 (MiniMax, China) Qwen3.5-122B-A10B · 75.7 (Alibaba, China) Qwen3.5-27B · 75.6 (Alibaba, China) GPT-5 Mini High · 75.4 (OpenAI, United States) GPT-5.2 · 75.4 (OpenAI, United States) GPT-5.2 High · 75.4 (OpenAI, United States) Qwen3.6-Plus · 75.2 (Alibaba, China) GPT-5.4 · 74.0 (OpenAI, United States) GPT-5.4 High · 74.0 (OpenAI, United States) GPT-5.4 Mini High · 73.3 (OpenAI, United States) GPT-5 High · 73.1 (OpenAI, United States) GPT-5.1 · 72.9 (OpenAI, United States) GPT-5.1 High · 72.9 (OpenAI, United States) Qwen3.5-35B-A3B · 72.5 (Alibaba, China) GLM-5 · 72.3 (Z.ai, China) minimax-m2.5 · 71.6 (MiniMax, China) Nvidia Nemotron 3 Super · 71.5 (NVIDIA, United States) OpenAI o3 · 71.4 (OpenAI, United States) Gemini 3 Pro · 70.4 (Google, United States) OpenAI o1 · 70.3 (OpenAI, United States) Kimi K2.5 · 70.2 (Moonshot AI, China) MiMo v2 Pro · 68.8 (Xiaomi, China) OpenAI o4-mini · 68.7 (OpenAI, United States) Kimi K2 Thinking · 68.1 (Moonshot AI, China) GLM-4.7 · 67.9 (Z.ai, China) GPT-5 Nano High · 67.5 (OpenAI, United States) Qwen3.6-27B · 67.5 (Alibaba, China) MiMo-V2.5 · 67.1 (Xiaomi, China) OpenAI o3-mini High · 67.1 (OpenAI, United States) Qwen3.5-9B · 66.7 (Alibaba, China) Qwen3.6 35B-A3B · 64.3 (Alibaba, China) Nemotron 3 Nano Omni · 63.2 (NVIDIA, United States) Claude Opus 4.7 · 58.6 (Anthropic, United States) Claude Opus 4.7 Thinking · 58.6 (Anthropic, United States) Claude Opus 4.5 (Thinking 32K) · 58.0 (Anthropic, United States) Claude Sonnet 4.5 (Thinking 32K) · 57.3 (Anthropic, United States) Claude Opus 4.1 (Thinking 16K) · 55.4 (Anthropic, United States) Gemini 3 Flash · 55.1 (Google, United States) Claude Sonnet 4 (Thinking 32K) · 54.7 (Anthropic, United States) Claude Opus 4 (Thinking 16K) · 53.7 (Anthropic, United States) Grok 4 (0709) · 53.7 (xAI, United States) Gemma 4 31B IT · 53.5 (Google, United States) Claude Opus 4.6 (Thinking) · 53.1 (Anthropic, United States) Grok 4.1 Fast Reasoning · 52.7 (xAI, United States) Amazon Nova 2 Pro · 52.0 (Amazon, United States) Grok 4 Fast Reasoning · 50.5 (xAI, United States) Gemini 2.5 Flash Lite Preview 06-17 (Thinking) · 49.9 (Google, United States) DeepSeek-V3.2 · 49.0 (DeepSeek, China) Gemini 2.5 Pro · 48.7 (Google, United States) Claude 3.7 Sonnet (Thinking 32K) · 48.3 (Anthropic, United States) Qwen3 Max Preview · 48.0 (Alibaba, China) GPT-5.2 Chat Latest · 47.4 (OpenAI, United States) Llama 3.3 70B Instruct · 47.1 (Meta, United States) Grok 3 Preview 02-24 · 46.9 (xAI, United States) Grok 3 Mini High · 45.9 (xAI, United States) Gemma 4 26B-A4B IT · 45.4 (Google, United States) Claude Sonnet 4 · 45.4 (Anthropic, United States) GPT-5 Chat · 45.0 (OpenAI, United States) Claude Opus 4.6 · 44.6 (Anthropic, United States) Qwen3 Max (2025-09-23) · 44.1 (Alibaba, China) GLM-4.5 · 44.1 (Z.ai, China) Gemini 2.5 Flash Preview 09-2025 · 43.5 (Google, United States) Claude Opus 4 · 43.3 (Anthropic, United States) Claude Opus 4.5 · 43.0 (Anthropic, United States) GPT-4.1 · 43.0 (OpenAI, United States) Llama 4 Maverick · 43.0 (Meta, United States) Claude Sonnet 4.5 · 42.6 (Anthropic, United States) Claude Haiku 4.5 · 42.0 (Anthropic, United States) Gemini 2.5 Flash Lite Preview 09-2025 (No Thinking) · 41.8 (Google, United States) Kimi K2 Instruct 0905 · 41.7 (Moonshot AI, China) Kimi K2 Instruct · 41.5 (Moonshot AI, China) Claude Sonnet 4.6 · 41.2 (Anthropic, United States) Amazon Nova 2 Lite · 40.5 (Amazon, United States) Gemma 4 E4B IT · 40.5 (Google, United States) Gemini 2.0 Flash · 40.2 (Google, United States) DeepSeek-R1 · 39.6 (DeepSeek, China) Llama 4 Scout · 39.5 (Meta, United States) Llama 3.1 405B Instruct · 39.0 (Meta, United States) Gemini 2.5 Flash · 39.0 (Google, United States) GPT-4.1 Mini · 38.3 (OpenAI, United States) DeepSeek-V3.1 · 37.8 (DeepSeek, China) Grok 4 Fast Chat · 37.7 (xAI, United States) Llama 3 70B Instruct · 37.1 (Meta, United States) GLM-4.6 · 36.7 (Z.ai, China) Qwen3-235B-A22B · 36.6 (Alibaba, China) DeepSeek-V3 · 34.8 (DeepSeek, China) Llama 3.1 70B Instruct · 34.4 (Meta, United States) GPT-4o · 34.3 (OpenAI, United States) Gemma 4 E2B IT · 33.6 (Google, United States) Qwen3-30B-A3B · 31.9 (Alibaba, China) Gemma 3 27B IT · 31.8 (Google, United States) Qwen3-32B · 31.5 (Alibaba, China) Llama 3.1 8B Instruct · 28.6 (Meta, United States) Gemma 3 4B IT · 28.3 (Google, United States) Llama 3 8B Instruct · 24.6 (Meta, United States) 0–100 AA LCR GPT-5 High · 75.6 (OpenAI, United States) GPT-5.1 · 75 (OpenAI, United States) GPT-5.1 High · 75 (OpenAI, United States) GPT-5.5 · 74.3 (OpenAI, United States) Claude Opus 4.5 (Thinking 32K) · 74 (Anthropic, United States) GPT-5.4 · 74 (OpenAI, United States) GPT-5.4 High · 74 (OpenAI, United States) MiMo-V2.5-Pro · 73.3 (Xiaomi, China) Gemini 3.1 Pro Preview · 72.7 (Google, United States) GPT-5.2 · 72.7 (OpenAI, United States) GPT-5.2 High · 72.7 (OpenAI, United States) Claude Opus 4.6 (Thinking) · 70.7 (Anthropic, United States) Gemini 3 Pro · 70.7 (Google, United States) Claude Opus 4.7 · 70.3 (Anthropic, United States) Claude Opus 4.7 Thinking · 70.3 (Anthropic, United States) Kimi K2.6 · 69.7 (Moonshot AI, China) Qwen3.6 Max Preview · 69.7 (Alibaba, China) Qwen3.6-Plus · 69.7 (Alibaba, China) Gemini 3.5 Flash · 69.3 (Google, United States) GPT-5.4 Mini High · 69.3 (OpenAI, United States) OpenAI o3 · 69.3 (OpenAI, United States) Qwen 3.7 Max · 69 (Alibaba, China) MiniMax M2.7 · 68.7 (MiniMax, China) Qwen3.6-27B · 68.7 (Alibaba, China) GPT-5 Mini High · 68 (OpenAI, United States) Grok 4 (0709) · 68 (xAI, United States) Grok 4.1 Fast Reasoning · 68 (xAI, United States) Qwen3.5-27B · 67.3 (Alibaba, China) Qwen3.5-122B-A10B · 66.7 (Alibaba, China) Claude Opus 4.1 (Thinking 16K) · 66.3 (Anthropic, United States) DeepSeek-V4-Pro · 66.3 (DeepSeek, China) Gemini 3 Flash (Thinking Minimal) · 66.3 (Google, United States) Kimi K2 Thinking · 66.3 (Moonshot AI, China) Gemini 2.5 Pro · 66 (Google, United States) GPT-5.4 Nano High · 66 (OpenAI, United States) minimax-m2.5 · 66 (MiniMax, China) Claude Sonnet 4.5 (Thinking 32K) · 65.7 (Anthropic, United States) Qwen3.5-397B-A17B · 65.7 (Alibaba, China) Claude Opus 4.5 · 65.3 (Anthropic, United States) Gemini 3.1 Flash Lite Preview · 65.3 (Google, United States) Kimi K2.5 · 65.3 (Moonshot AI, China) Claude Sonnet 4 (Thinking 32K) · 64.7 (Anthropic, United States) Grok 4 Fast Reasoning · 64.7 (xAI, United States) Grok 4.3 · 64.3 (xAI, United States) Grok 4.3 beta · 64.3 (xAI, United States) GLM-4.7 · 64 (Z.ai, China) GPT-5 Chat · 63.7 (OpenAI, United States) Qwen3.6 35B-A3B · 63.7 (Alibaba, China) GLM-5 · 63.3 (Z.ai, China) DeepSeek-V4-Flash · 63 (DeepSeek, China) MiMo-V2.5 · 62.7 (Xiaomi, China) Qwen3.5-35B-A3B · 62.7 (Alibaba, China) GLM-5.1 · 62.3 (Z.ai, China) GPT-4.1 · 61 (OpenAI, United States) Claude 3.7 Sonnet (Thinking 32K) · 60.7 (Anthropic, United States) MiMo v2 Pro · 60.7 (Xiaomi, China) Nvidia Nemotron 3 Super · 60 (NVIDIA, United States) OpenAI o1 · 59.3 (OpenAI, United States) Grok 4.20 Beta 0309 Reasoning · 59 (xAI, United States) Qwen3.5-9B · 59 (Alibaba, China) Claude Opus 4.6 · 58.3 (Anthropic, United States) Claude Sonnet 4.6 · 57.7 (Anthropic, United States) Gemini 2.5 Flash Preview 09-2025 · 56.7 (Google, United States) OpenAI o4-mini · 55 (OpenAI, United States) DeepSeek-R1 · 54.7 (DeepSeek, China) Grok 3 Preview 02-24 · 54.7 (xAI, United States) Kimi K2 Instruct 0905 · 52.3 (Moonshot AI, China) Claude Sonnet 4.5 · 51.3 (Anthropic, United States) Gemini 2.5 Flash Lite Preview 06-17 (Thinking) · 51.3 (Google, United States) Kimi K2 Instruct · 51 (Moonshot AI, China) Grok 3 Mini High · 50.3 (xAI, United States) GLM-4.5 · 48.3 (Z.ai, China) Gemini 2.5 Flash Lite Preview 09-2025 (No Thinking) · 48 (Google, United States) Gemini 3 Flash · 48 (Google, United States) Qwen3 Max (2025-09-23) · 46.7 (Alibaba, China) Llama 4 Maverick · 46 (Meta, United States) Gemini 2.5 Flash · 45.9 (Google, United States) DeepSeek-V3.1 · 45 (DeepSeek, China) Claude Sonnet 4 · 44.3 (Anthropic, United States) Claude Haiku 4.5 · 43.7 (Anthropic, United States) GPT-4.1 Mini · 42.3 (OpenAI, United States) GPT-5 Nano High · 41.7 (OpenAI, United States) Gemma 4 26B-A4B IT · 39.7 (Google, United States) Qwen3 Max Preview · 39.7 (Alibaba, China) OpenAI o3-mini High · 39.3 (OpenAI, United States) DeepSeek-V3.2 · 39 (DeepSeek, China) GPT-5.2 Chat Latest · 38 (OpenAI, United States) Claude Opus 4 · 36 (Anthropic, United States) Gemma 4 31B IT · 36 (Google, United States) Nemotron 3 Nano Omni · 35.7 (NVIDIA, United States) Claude Opus 4 (Thinking 16K) · 33.7 (Anthropic, United States) DeepSeek-V3 · 29 (DeepSeek, China) Amazon Nova 2 Pro · 28.3 (Amazon, United States) Gemini 2.0 Flash · 28.3 (Google, United States) GLM-4.6 · 26.3 (Z.ai, China) Llama 4 Scout · 25.8 (Meta, United States) Llama 3.1 405B Instruct · 24.3 (Meta, United States) Grok 4 Fast Chat · 20 (xAI, United States) Gemma 4 E4B IT · 18 (Google, United States) Amazon Nova 2 Lite · 17.7 (Amazon, United States) Llama 3.1 8B Instruct · 15.7 (Meta, United States) Gemma 4 E2B IT · 15 (Google, United States) Llama 3.3 70B Instruct · 15 (Meta, United States) Llama 3.1 70B Instruct · 6.3 (Meta, United States) Gemma 3 27B IT · 5.7 (Google, United States) Gemma 3 4B IT · 5.7 (Google, United States) GPT-4o · 0 (OpenAI, United States) Llama 3 70B Instruct · 0 (Meta, United States) Llama 3 8B Instruct · 0 (Meta, United States) Qwen3-235B-A22B · 0 (Alibaba, China) Qwen3-30B-A3B · 0 (Alibaba, China) Qwen3-32B · 0 (Alibaba, China) 20–100 Arena Score Claude Opus 4.6 (Thinking) · 100 (Anthropic, United States) Claude Opus 4.6 · 99.7 (Anthropic, United States) Claude Opus 4.7 Thinking · 98.0 (Anthropic, United States) Gemini 3.1 Pro Preview · 97.3 (Google, United States) Claude Opus 4.7 · 97.2 (Anthropic, United States) Gemini 3 Pro · 97.0 (Google, United States) Meta Muse Spark · 96.2 (Meta, United States) GPT-5.4 High · 96.0 (OpenAI, United States) Qwen3.5 Max Preview · 95.7 (Alibaba, China) GLM-5.1 · 95.5 (Z.ai, China) Gemini 3 Flash · 95.0 (Google, United States) GPT-5.5 · 94.7 (OpenAI, United States) Gemini 2.5 Pro · 93.7 (Google, United States) GPT-5.4 · 93.5 (OpenAI, United States) Kimi K2.6 · 93.5 (Moonshot AI, China) Claude Sonnet 4.6 · 93.3 (Anthropic, United States) Grok 4.20 Beta 0309 Reasoning · 93.2 (xAI, United States) Grok 4.20 Multi-Agent Beta 0309 · 92.7 (xAI, United States) Claude Opus 4.5 · 92.5 (Anthropic, United States) Dola Seed 2.0 Pro · 92.5 (ByteDance, China) Claude Opus 4.5 (Thinking 32K) · 92.0 (Anthropic, United States) Gemini 3 Flash (Thinking Minimal) · 92.0 (Google, United States) ERNIE 5.0 0110 · 92.0 (Baidu, China) DeepSeek-V4-Pro · 92.0 (DeepSeek, China) Grok 4.20 Beta1 · 92.0 (xAI, United States) GLM-5 · 91.9 (Z.ai, China) Kimi K2.5 · 91.8 (Moonshot AI, China) Qwen3.6 Max Preview · 91.8 (Alibaba, China) Gemma 4 31B IT · 91.5 (Google, United States) ERNIE 5.0 Preview 1203 · 91.4 (Baidu, China) GPT-5.1 High · 91.3 (OpenAI, United States) Qwen3.5-397B-A17B · 91.2 (Alibaba, China) GLM-4.6 · 91.2 (Z.ai, China) GPT-5.2 Chat Latest · 91.0 (OpenAI, United States) Qwen3 Max Preview · 90.9 (Alibaba, China) Grok 4.1 (Thinking) · 90.8 (xAI, United States) Claude Sonnet 4.5 · 90.7 (Anthropic, United States) Grok 4.1 · 90.7 (xAI, United States) GLM-4.7 · 90.5 (Z.ai, China) MiMo v2 Pro · 90.4 (Xiaomi, China) Gemma 4 26B-A4B IT · 90.3 (Google, United States) Claude Sonnet 4.5 (Thinking 32K) · 89.9 (Anthropic, United States) ERNIE 5.0 Preview 1022 · 89.6 (Baidu, China) GLM-4.5 · 89.4 (Z.ai, China) ChatGPT-4o Latest (2025-03-26) · 89.4 (OpenAI, United States) DeepSeek-V4-Flash · 89.3 (DeepSeek, China) DeepSeek-R1 · 89.3 (DeepSeek, China) Grok 3 Preview 02-24 · 88.8 (xAI, United States) DeepSeek-V3.2 · 88.6 (DeepSeek, China) GPT-5.1 · 88.4 (OpenAI, United States) DeepSeek-V3.1 · 88.0 (DeepSeek, China) Claude Opus 4.1 (Thinking 16K) · 87.8 (Anthropic, United States) Gemini 2.5 Flash · 87.7 (Google, United States) Qwen3.5-122B-A10B · 87.7 (Alibaba, China) Claude Opus 4.1 · 87.7 (Anthropic, United States) GPT-4.5 Preview · 87.7 (OpenAI, United States) GPT-5.2 High · 87.6 (OpenAI, United States) Gemini 3.1 Flash Lite Preview · 87.6 (Google, United States) GPT-5.4 Mini High · 87.4 (OpenAI, United States) Kimi K2 Thinking · 87.2 (Moonshot AI, China) Qwen3 Max (2025-09-23) · 87.0 (Alibaba, China) GPT-5.2 · 86.8 (OpenAI, United States) Grok 4 (0709) · 86.6 (xAI, United States) OpenAI o3 · 86.5 (OpenAI, United States) Grok 4 Fast Chat · 86.5 (xAI, United States) Grok 4.1 Fast Reasoning · 86.3 (xAI, United States) Qwen3.5-27B · 86.3 (Alibaba, China) Gemini 2.5 Flash Preview 09-2025 · 86.1 (Google, United States) Hunyuan Vision 1.5 Thinking · 86 (Tencent, China) GPT-5 High · 85.9 (OpenAI, United States) GPT-5 Chat · 85.6 (OpenAI, United States) MiniMax M2.7 · 85.1 (MiniMax, China) Hunyuan T1 · 85.0 (Tencent, China) Grok 4 Fast Reasoning · 84.7 (xAI, United States) Qwen3.5 Flash · 84.7 (Alibaba, China) Qwen3.5-35B-A3B · 84.5 (Alibaba, China) Qwen3-235B-A22B · 84.1 (Alibaba, China) Claude Haiku 4.5 · 83.8 (Anthropic, United States) GPT-5.3 Chat Latest · 83.2 (OpenAI, United States) GPT-4.1 · 82.4 (OpenAI, United States) Kimi K2 Instruct 0905 · 82.0 (Moonshot AI, China) Gemini 2.5 Flash Lite Preview 09-2025 (No Thinking) · 81.9 (Google, United States) Nvidia Nemotron 3 Super · 81.8 (NVIDIA, United States) Hunyuan TurboS (2025-04-16) · 81.5 (Tencent, China) Claude Opus 4 (Thinking 16K) · 81.4 (Anthropic, United States) DeepSeek-V3 · 81.3 (DeepSeek, China) GPT-5 Mini High · 81.1 (OpenAI, United States) GPT-5.4 Nano High · 81.1 (OpenAI, United States) Kimi K2 Instruct · 80.7 (Moonshot AI, China) Gemini 2.5 Flash Lite Preview 06-17 (Thinking) · 80.3 (Google, United States) Grok 3 Mini High · 80.1 (xAI, United States) Qwen2.5 Max · 80.1 (Alibaba, China) OpenAI o1 · 80.0 (OpenAI, United States) Claude Opus 4 · 79.7 (Anthropic, United States) Amazon Nova 2 Lite · 79.5 (Amazon, United States) Grok 3 Mini Beta · 79.5 (xAI, United States) minimax-m2.5 · 79.1 (MiniMax, China) Gemma 3 27B IT · 78.8 (Google, United States) Gemini 2.0 Flash · 78.2 (Google, United States) OpenAI o1 Preview · 78 (OpenAI, United States) OpenAI o4-mini · 78.0 (OpenAI, United States) Claude Sonnet 4 (Thinking 32K) · 77.3 (Anthropic, United States) GPT-4.1 Mini · 76.1 (OpenAI, United States) Qwen3-32B · 76.1 (Alibaba, China) Claude Sonnet 4 · 75.7 (Anthropic, United States) OpenAI o3-mini High · 75.6 (OpenAI, United States) Step 1o Turbo (202506) · 75.3 (StepFun, China) GLM-4 Plus (0111) · 74.7 (Zhipu, China) Gemini 2.0 Flash Lite Preview · 74.5 (Google, United States) Qwen Plus (0125) · 74.0 (Alibaba, China) Step 2 16K Exp (202412) · 73.2 (StepFun, China) GPT-5 Nano High · 73.1 (OpenAI, United States) Hunyuan TurboS (2025-02-26) · 73.0 (Tencent, China) OpenAI o3-mini · 72.9 (OpenAI, United States) OpenAI o1-mini · 72.6 (OpenAI, United States) Qwen3-30B-A3B · 72.6 (Alibaba, China) Claude 3.7 Sonnet (Thinking 32K) · 72.2 (Anthropic, United States) Hunyuan Turbo (0110) · 71.8 (Tencent, China) Grok 2 · 70.8 (xAI, United States) Yi Lightning · 70.3 (01 AI, China) GPT-4o · 70.1 (OpenAI, United States) Gemma 3 4B IT · 68.7 (Google, United States) Llama 4 Maverick · 68.2 (Meta, United States) Llama 3.1 405B Instruct · 67.6 (Meta, United States) Llama 4 Scout · 67.2 (Meta, United States) Llama 3.3 70B Instruct · 66.3 (Meta, United States) Llama 3.1 70B Instruct · 64.2 (Meta, United States) Llama 3 70B Instruct · 58.2 (Meta, United States) Llama 3.1 8B Instruct · 53.0 (Meta, United States) Llama 3 8B Instruct · 49.9 (Meta, United States) Llama 2 70B Chat · 42.3 (Meta, United States) Llama 2 13B Chat · 37.7 (Meta, United States) Llama 2 7B Chat · 33.0 (Meta, United States)