Reasoning

BIG-Bench Hard Leaderboard

Challenging tasks from BIG-Bench that require advanced reasoning

Why This Matters

Measures ability to solve problems that stumped earlier models - indicates true reasoning capability

Good Scores

50%+ is competent, 65%+ is strong, 75%+ is exceptional

Use Cases

  • Complex decision support
  • Strategic planning tools
  • Advanced problem solving
  • Logic-based applications

Peak Score

65.47

Average

27.54

Models Tested

53,438

Median Score

29.78

Performance by Model Size

How different size classes perform on this benchmark

📄 License Analysis

Performance by license type

license:apache-2.065.47
14019 modelsAvg: 29.88
Unknown64.05
14487 modelsAvg: 26.95
llama63.47
20447 modelsAvg: 26.67
license:mit62.16
2592 modelsAvg: 26.03
license:gpl-3.052.46
168 modelsAvg: 15.39
license:cc-by-nc-4.046.00
1019 modelsAvg: 30.37
llama-factory36.61
70 modelsAvg: 28.41
license:cc-by-nc-sa-4.036.50
46 modelsAvg: 34.78

🔧 Framework Analysis

Performance by framework

OTHER65.47
53299 modelsAvg: 27.58
PYTORCH29.38
116 modelsAvg: 13.50
HUGGINGFACE2.70
23 modelsAvg: 2.70

About BIG-Bench Hard

BIG-Bench Hard (BBH) is a suite of 23 challenging BIG-Bench tasks where prior language models did not outperform average human-rater performance. It tests complex reasoning, world knowledge, and multi-step problem solving.

Last updated: 11/21/2025

Test These Models Yourself

Run benchmarks on your own data with these platforms

Together.ai

Instant API access to this model

Fastest API

Production-ready inference API. Start free, scale to millions.

Try Free API

Modal

Run this model on serverless GPU

Most Popular

Deploy in seconds with $30 free credits. Pay only for what you use.

Get $30 Free Credits

RunPod

Rent GPU starting at $0.34/hour

Best Value

Deploy on cloud GPU or serverless. 70% cheaper than AWS.

Start from $0.34/hr

Disclosure: We may earn a commission from these partners. This helps keep LLMYourWay free.

Complete Leaderboard

Top 50 models ranked by BIG-Bench Hard performance

T3Q Qwen2.5 14b V1.0 E3

JungZoonalicense:apache-2.0

65.47

Score

T3Q Qwen2.5 14b V1.0 E3

JungZoonalicense:apache-2.0

65.47

Score

T3Q Qwen2.5 14b V1.0 E3

JungZoonalicense:apache-2.0

65.47

Score

#4

T3Q Qwen2.5 14b V1.0 E3

JungZoonalicense:apache-2.0

65.47

Score

#5

T3Q Qwen2.5 14b V1.0 E3

JungZoonalicense:apache-2.0

65.47

Score

#6

T3Q Qwen2.5 14b V1.0 E3

JungZoonalicense:apache-2.0

65.47

Score

#7

T3Q Qwen2.5 14b V1.0 E3

JungZoonalicense:apache-2.0

65.47

Score

#8

T3Q Qwen2.5 14b V1.0 E3

JungZoonalicense:apache-2.0

65.47

Score

#9

T3Q Qwen2.5 14b V1.0 E3

JungZoonalicense:apache-2.0

65.47

Score

#10

T3Q Qwen2.5 14b V1.0 E3

JungZoonalicense:apache-2.0

65.47

Score

#11

T3Q Qwen2.5 14b V1.0 E3

JungZoonalicense:apache-2.0

65.47

Score

#12

T3Q Qwen2.5 14b V1.0 E3

JungZoonalicense:apache-2.0

65.47

Score

#13

T3Q Qwen2.5 14b V1.0 E3

JungZoonalicense:apache-2.0

65.47

Score

#14

T3Q Qwen2.5 14b V1.0 E3

JungZoonalicense:apache-2.0

65.47

Score

#15

T3Q Qwen2.5 14b V1.0 E3

JungZoonalicense:apache-2.0

65.47

Score

#16

T3Q Qwen2.5 14b V1.0 E3

JungZoonalicense:apache-2.0

65.47

Score

#17

T3Q Qwen2.5 14b V1.0 E3

JungZoonalicense:apache-2.0

65.47

Score

#18

T3Q Qwen2.5 14b V1.0 E3

JungZoonalicense:apache-2.0

65.47

Score

#19

T3Q Qwen2.5 14b V1.0 E3

JungZoonalicense:apache-2.0

65.47

Score

#20

T3Q Qwen2.5 14b V1.0 E3

JungZoonalicense:apache-2.0

65.47

Score

#21

T3Q Qwen2.5 14b V1.0 E3

JungZoonalicense:apache-2.0

65.47

Score

#22

T3Q Qwen2.5 14b V1.0 E3

JungZoonalicense:apache-2.0

65.47

Score

#23

T3Q Qwen2.5 14b V1.0 E3

JungZoonalicense:apache-2.0

65.47

Score

#24

T3Q Qwen2.5 14b V1.0 E3

JungZoonalicense:apache-2.0

65.47

Score

#25

shuttle-3

shuttleai

64.05

Score

#26

shuttle-3

shuttleai

64.05

Score

#27

shuttle-3

shuttleai

64.05

Score

#28

shuttle-3

shuttleai

64.05

Score

#29

shuttle-3

shuttleai

64.05

Score

#30

shuttle-3

shuttleai

64.05

Score

#31

shuttle-3

shuttleai

64.05

Score

#32

shuttle-3

shuttleai

64.05

Score

#33

shuttle-3

shuttleai

64.05

Score

#34

shuttle-3

shuttleai

64.05

Score

#35

shuttle-3

shuttleai

64.05

Score

#36

shuttle-3

shuttleai

64.05

Score

#37

shuttle-3

shuttleai

64.05

Score

#38

shuttle-3

shuttleai

64.05

Score

#39

shuttle-3

shuttleai

64.05

Score

#40

shuttle-3

shuttleai

64.05

Score

#41

shuttle-3

shuttleai

64.05

Score

#42

shuttle-3

shuttleai

64.05

Score

#43

shuttle-3

shuttleai

64.05

Score

#44

shuttle-3

shuttleai

64.05

Score

#45

shuttle-3

shuttleai

64.05

Score

#46

shuttle-3

shuttleai

64.05

Score

#47

shuttle-3

shuttleai

64.05

Score

#48

internlm2_5-20b-llamafied

IntervitensIncllama

63.47

Score

#49

internlm2_5-20b-llamafied

IntervitensIncllama

63.47

Score

#50

internlm2_5-20b-llamafied

IntervitensIncllama

63.47

Score