Benchmarks
Browse all benchmarks on BenchGecko
BenchGecko tracks 128 benchmark evaluation suites with score evolution over time and methodology documentation.
Benchmark Categories
| Category | Count | Key Benchmarks |
|---|---|---|
| Knowledge | 8 | MMLU, MMLU-Pro, TriviaQA |
| Coding | 14 | HumanEval, SWE-bench Verified, MBPP, LiveCodeBench, Aider, Codeforces |
| Mathematics | 8 | MATH, GSM8K, AIME 2024, AMC 2023 |
| Reasoning | 10 | ARC Challenge, HellaSwag, WinoGrande, BBH, MuSR, CLRS |
| Science | 4 | GPQA Diamond, GPQA, MedQA |
| Long Context | 6 | RULER, NIAH, LongBench v2, GraphWalks BFS 256K |
| Instruction Following | 5 | IFEval, AlpacaEval 2.0, MT-Bench |
| Safety & Factuality | 4 | TruthfulQA, SimpleQA |
| Human Preference | 3 | Chatbot Arena ELO, Arena Hard |
| Agentic | 6 | Tau-bench, WebArena, GAIA, OSWorld |
| Tool Use | 4 | BFCL, NATURAL, ToolBench |
| Multimodal | 5 | MMMU, MathVista, VideoMME |
| Domain | 8 | MedQA, LegalBench, FinanceBench |
| Other | 43 | Specialized and emerging benchmarks |
Benchmark Profiles
Each benchmark has a dedicated page showing:
- Description and methodology
- What it measures and why it matters
- Score distribution across all tested models
- Top-scoring models with evolution over time
- Links to original papers and evaluation code
Score Normalization
All scores on BenchGecko are min-max normalized to a 0-100 scale. The average score displayed on model profiles is a weighted average across all available benchmarks, with harder benchmarks receiving higher weight.
Detailed normalization methodology: benchgecko.ai/methodology
Related Pages
- Model Rankings -- models sorted by benchmark performance
- Compare Tool -- benchmark-by-benchmark model comparison
- Agents -- agent-specific evaluations
- Methodology -- how scores are collected and verified