Skip to content

BenchGecko Documentation

Benchmarks

Benchmarks

Browse all benchmarks on BenchGecko

BenchGecko tracks 128 benchmark evaluation suites with score evolution over time and methodology documentation.

Benchmark Categories

Category	Count	Key Benchmarks
Knowledge	8	MMLU, MMLU-Pro, TriviaQA
Coding	14	HumanEval, SWE-bench Verified, MBPP, LiveCodeBench, Aider, Codeforces
Mathematics	8	MATH, GSM8K, AIME 2024, AMC 2023
Reasoning	10	ARC Challenge, HellaSwag, WinoGrande, BBH, MuSR, CLRS
Science	4	GPQA Diamond, GPQA, MedQA
Long Context	6	RULER, NIAH, LongBench v2, GraphWalks BFS 256K
Instruction Following	5	IFEval, AlpacaEval 2.0, MT-Bench
Safety & Factuality	4	TruthfulQA, SimpleQA
Human Preference	3	Chatbot Arena ELO, Arena Hard
Agentic	6	Tau-bench, WebArena, GAIA, OSWorld
Tool Use	4	BFCL, NATURAL, ToolBench
Multimodal	5	MMMU, MathVista, VideoMME
Domain	8	MedQA, LegalBench, FinanceBench
Other	43	Specialized and emerging benchmarks

Benchmark Profiles

Each benchmark has a dedicated page showing:

Description and methodology
What it measures and why it matters
Score distribution across all tested models
Top-scoring models with evolution over time
Links to original papers and evaluation code

Score Normalization

All scores on BenchGecko are min-max normalized to a 0-100 scale. The average score displayed on model profiles is a weighted average across all available benchmarks, with harder benchmarks receiving higher weight.

Detailed normalization methodology: benchgecko.ai/methodology

Model Rankings -- models sorted by benchmark performance
Compare Tool -- benchmark-by-benchmark model comparison
Agents -- agent-specific evaluations
Methodology -- how scores are collected and verified