BIG-Bench Hard Leaderboard
Challenging tasks from BIG-Bench that require advanced reasoning
Why This Matters
Measures ability to solve problems that stumped earlier models - indicates true reasoning capability
Good Scores
50%+ is competent, 65%+ is strong, 75%+ is exceptional
Use Cases
- •Complex decision support
- •Strategic planning tools
- •Advanced problem solving
- •Logic-based applications
Peak Score
65.47
Average
27.54
Models Tested
53,438
Median Score
29.78
Efficiency Leaders
Best performance per billion parameters - The smart choices
ChatWaifu_v1.4
100.0M params • Score: 31.63
Efficiency
316.31
ChatWaifu_v1.4
100.0M params • Score: 31.63
Efficiency
316.31
ChatWaifu_v1.4
100.0M params • Score: 31.63
Efficiency
316.31
ChatWaifu_v1.4
100.0M params • Score: 31.63
Efficiency
316.31
ChatWaifu_v1.4
100.0M params • Score: 31.63
Efficiency
316.31
ChatWaifu_v1.4
100.0M params • Score: 31.63
Efficiency
316.31
ChatWaifu_v1.4
100.0M params • Score: 31.63
Efficiency
316.31
ChatWaifu_v1.4
100.0M params • Score: 31.63
Efficiency
316.31
ChatWaifu_v1.4
100.0M params • Score: 31.63
Efficiency
316.31
ChatWaifu_v1.4
100.0M params • Score: 31.63
Efficiency
316.31
Performance by Model Size
How different size classes perform on this benchmark
large
Avg Score: 38.59
🏆 Open Source Champions
Top permissively licensed models
📈 Most Downloaded Models
Popularity meets performance
📄 License Analysis
Performance by license type
🔧 Framework Analysis
Performance by framework
About BIG-Bench Hard
BIG-Bench Hard (BBH) is a suite of 23 challenging BIG-Bench tasks where prior language models did not outperform average human-rater performance. It tests complex reasoning, world knowledge, and multi-step problem solving.
Test These Models Yourself
Run benchmarks on your own data with these platforms
Together.ai
Instant API access to this model
Production-ready inference API. Start free, scale to millions.
Try Free APIModal
Run this model on serverless GPU
Deploy in seconds with $30 free credits. Pay only for what you use.
Get $30 Free CreditsRunPod
Rent GPU starting at $0.34/hour
Deploy on cloud GPU or serverless. 70% cheaper than AWS.
Start from $0.34/hrDisclosure: We may earn a commission from these partners. This helps keep LLMYourWay free.
Complete Leaderboard
Top 50 models ranked by BIG-Bench Hard performance
T3Q Qwen2.5 14b V1.0 E3
65.47
Score
T3Q Qwen2.5 14b V1.0 E3
65.47
Score
T3Q Qwen2.5 14b V1.0 E3
65.47
Score
T3Q Qwen2.5 14b V1.0 E3
65.47
Score
T3Q Qwen2.5 14b V1.0 E3
65.47
Score
T3Q Qwen2.5 14b V1.0 E3
65.47
Score
T3Q Qwen2.5 14b V1.0 E3
65.47
Score
T3Q Qwen2.5 14b V1.0 E3
65.47
Score
T3Q Qwen2.5 14b V1.0 E3
65.47
Score
T3Q Qwen2.5 14b V1.0 E3
65.47
Score
T3Q Qwen2.5 14b V1.0 E3
65.47
Score
T3Q Qwen2.5 14b V1.0 E3
65.47
Score
T3Q Qwen2.5 14b V1.0 E3
65.47
Score
T3Q Qwen2.5 14b V1.0 E3
65.47
Score
T3Q Qwen2.5 14b V1.0 E3
65.47
Score
T3Q Qwen2.5 14b V1.0 E3
65.47
Score
T3Q Qwen2.5 14b V1.0 E3
65.47
Score
T3Q Qwen2.5 14b V1.0 E3
65.47
Score
T3Q Qwen2.5 14b V1.0 E3
65.47
Score
T3Q Qwen2.5 14b V1.0 E3
65.47
Score
T3Q Qwen2.5 14b V1.0 E3
65.47
Score
T3Q Qwen2.5 14b V1.0 E3
65.47
Score
T3Q Qwen2.5 14b V1.0 E3
65.47
Score
T3Q Qwen2.5 14b V1.0 E3
65.47
Score
shuttle-3
64.05
Score
shuttle-3
64.05
Score
shuttle-3
64.05
Score
shuttle-3
64.05
Score
shuttle-3
64.05
Score
shuttle-3
64.05
Score
shuttle-3
64.05
Score
shuttle-3
64.05
Score
shuttle-3
64.05
Score
shuttle-3
64.05
Score
shuttle-3
64.05
Score
shuttle-3
64.05
Score
shuttle-3
64.05
Score
shuttle-3
64.05
Score
shuttle-3
64.05
Score
shuttle-3
64.05
Score
shuttle-3
64.05
Score
shuttle-3
64.05
Score
shuttle-3
64.05
Score
shuttle-3
64.05
Score
shuttle-3
64.05
Score
shuttle-3
64.05
Score
shuttle-3
64.05
Score
internlm2_5-20b-llamafied
63.47
Score
internlm2_5-20b-llamafied
63.47
Score
internlm2_5-20b-llamafied
63.47
Score