Skip to content

Benchmarks

Browse all benchmarks on BenchGecko

BenchGecko tracks 128 benchmark evaluation suites with score evolution over time and methodology documentation.

Benchmark Categories

Category Count Key Benchmarks
Knowledge 8 MMLU, MMLU-Pro, TriviaQA
Coding 14 HumanEval, SWE-bench Verified, MBPP, LiveCodeBench, Aider, Codeforces
Mathematics 8 MATH, GSM8K, AIME 2024, AMC 2023
Reasoning 10 ARC Challenge, HellaSwag, WinoGrande, BBH, MuSR, CLRS
Science 4 GPQA Diamond, GPQA, MedQA
Long Context 6 RULER, NIAH, LongBench v2, GraphWalks BFS 256K
Instruction Following 5 IFEval, AlpacaEval 2.0, MT-Bench
Safety & Factuality 4 TruthfulQA, SimpleQA
Human Preference 3 Chatbot Arena ELO, Arena Hard
Agentic 6 Tau-bench, WebArena, GAIA, OSWorld
Tool Use 4 BFCL, NATURAL, ToolBench
Multimodal 5 MMMU, MathVista, VideoMME
Domain 8 MedQA, LegalBench, FinanceBench
Other 43 Specialized and emerging benchmarks

Benchmark Profiles

Each benchmark has a dedicated page showing:

  • Description and methodology
  • What it measures and why it matters
  • Score distribution across all tested models
  • Top-scoring models with evolution over time
  • Links to original papers and evaluation code

Score Normalization

All scores on BenchGecko are min-max normalized to a 0-100 scale. The average score displayed on model profiles is a weighted average across all available benchmarks, with harder benchmarks receiving higher weight.

Detailed normalization methodology: benchgecko.ai/methodology