LLMGauntlet
// MODEL LEADERBOARD

16 LLMs on Pearl's causal ladder

Evaluated on 3 causal DAGs (mediator, collider, confounder) across 2 regimes (static, interactive). Island_01 shows full scores; Island_02 and Island_03 are commercial worlds — ranks and tier badges only.

16 models
nP1P2P3Flag
01
deepseek-r1
DeepSeek
10
93.5
[86.5, 98.0]
82.5
[75.5, 88.0]
70.0
[61.0, 78.9]
82.0
[79.3, 84.7]
02
claude-opus-4-5
Anthropic
10
91.5
[85.5, 97.0]
76.0
[68.5, 83.0]
65.9
[62.1, 70.2]
77.6
[75.3, 80.0]
03
gemini-2.5-pro
Google
7
72.2
[48.9, 90.0]
79.4
[72.2, 86.1]
68.7
[59.9, 73.4]
75.8
[70.6, 80.8]
04
deepseek-chat
DeepSeek
10
85.5
[78.0, 91.0]
65.5
[61.0, 71.5]
72.1
[65.4, 78.5]
73.5
[69.4, 77.7]
05
claude-sonnet-4-5
Anthropic
10
89.5
[87.0, 92.5]
65.0
[65.0, 65.0]
64.7
[59.9, 69.5]
72.2
[70.4, 74.1]
06
o3-mini
OpenAI
10
79.5
[64.0, 90.5]
73.0
[61.0, 84.0]
59.6
[54.7, 64.3]
70.9
[63.4, 77.9]
07
gpt-4o
OpenAI
15
75.0
[69.0, 81.0]
77.0
[69.7, 83.7]
56.4
[54.1, 58.8]
70.2
[66.6, 73.5]
08
mistral-large
Mistral
10
74.0
[62.5, 80.5]
67.0
[60.0, 73.5]
66.7
[57.3, 75.2]
69.0
[64.7, 72.4]
09
mistral-small-3
Mistral
10
71.0
[58.0, 82.0]
71.5
[64.5, 80.0]
63.7
[58.3, 69.2]
69.0
[64.1, 73.8]
10
command-r-plus
Cohere
10
64.0
[53.0, 74.0]
56.0
[55.0, 58.0]
53.2
[48.6, 58.5]
57.5
[54.9, 60.4]
11
claude-haiku-3-5
Anthropic
10
66.0
[54.0, 75.5]
65.0
[65.0, 65.0]
35.4
[29.6, 42.1]
56.4
[52.0, 60.2]
12
qwen-72b
Alibaba
10
47.5
[35.5, 59.0]
65.0
[65.0, 65.0]
52.4
[46.5, 57.6]
56.0
[51.8, 60.1]
13
llama-3.3-70b
Meta
10
42.5
[35.0, 53.5]
64.5
[59.0, 69.5]
48.6
[36.1, 57.7]
53.1
[46.1, 60.0]
14
gpt-4o-mini
OpenAI
10
33.0
[30.0, 35.0]
70.0
[59.5, 80.5]
44.8
[38.2, 53.0]
51.3
[46.4, 56.7]
15
llama-3.1-8b
Meta
10
47.5
[40.0, 54.5]
57.5
[55.0, 62.5]
24.0
[19.5, 28.5]
44.5
[41.8, 47.6]
16
gemini-2.0-flash
Google
10
30.0
[27.0, 33.0]
65.0
[65.0, 65.0]
27.9
[18.1, 37.6]
43.4
[39.9, 46.7]

// CROSS-WORLD ANALYSIS

Consistency across 3 DAGs

How consistent are LLMs across causal structures? Lower range = more consistent. Only 1/16 models stays top-3 across all 3 DAGs and both regimes.

// MEAN ρ — STATIC
0.43
Kendall W 0.62 · full concordance
// MEAN ρ — STATIC (EXCL. ARTEFACT)
0.59
Kendall W 0.74 · excl. mode collapse
// MEAN ρ — INTERACTIVE
0.36
Kendall W 0.58 · confounded by F.M10
ConsistentModerateSwinger
#ModelI01sI01iI02sI02iI03sI03iRangeCategory
1claude-opus-4-5★ Consistent2211332consistent
2gemini-2.5-pro33322119swinger
3deepseek-r11675616swinger
4claude-sonnet-4-5512451413swinger
5o3-mini6863445moderate
6deepseek-chat45471177swinger
7mistral-large8456151311swinger
8gpt-4o7912914212swinger
9mistral-small-3971088169swinger
10llama-3.3-70b13101311967swinger
11command-r-plus101114121286swinger
12gpt-4o-mini141215157510swinger
13claude-haiku-3-5111411131395moderate
14qwen-72b121681010158swinger
15llama-3.1-8b1515161411215swinger
16gemini-2.0-flash161391616107swinger