Methodology
Benchmarks Overview
We present benchmark results as reported by authors and trusted evaluators. Scores are accompanied by sources on each model page.
MMLU
Massive Multitask Language Understanding (knowledge + reasoning)
HellaSwag
Commonsense reasoning and completion
HumanEval
Code generation pass@1
GSM8K
Grade school math word problems
How We Use Benchmarks
Benchmarks are helpful signals but not the whole story. We combine reported scores with real-world usage context, licensing, pricing, and context window to help you choose the right model for your use case.
- Scores are shown with sources whenever available
- Comparisons highlight strengths instead of a single composite rank
- Open-source models may vary by quantization and inference setup
- We plan to incorporate community evals and standardized harnesses over time