// MODEL LEADERBOARD
16 LLMs on Pearl's causal ladder
Evaluated on 3 causal DAGs (mediator, collider, confounder) across 2 regimes (static, interactive). Island_01 shows full scores; Island_02 and Island_03 are commercial worlds — ranks and tier badges only.
16 models
| n | P1 | P2 | P3 | Flag | |||
|---|---|---|---|---|---|---|---|
| 01 | deepseek-r1 DeepSeek | 10 | 93.5 [86.5, 98.0] | 82.5 [75.5, 88.0] | 70.0 [61.0, 78.9] | 82.0 [79.3, 84.7] | |
| 02 | claude-opus-4-5 Anthropic | 10 | 91.5 [85.5, 97.0] | 76.0 [68.5, 83.0] | 65.9 [62.1, 70.2] | 77.6 [75.3, 80.0] | |
| 03 | gemini-2.5-pro Google | 7 | 72.2 [48.9, 90.0] | 79.4 [72.2, 86.1] | 68.7 [59.9, 73.4] | 75.8 [70.6, 80.8] | |
| 04 | deepseek-chat DeepSeek | 10 | 85.5 [78.0, 91.0] | 65.5 [61.0, 71.5] | 72.1 [65.4, 78.5] | 73.5 [69.4, 77.7] | |
| 05 | claude-sonnet-4-5 Anthropic | 10 | 89.5 [87.0, 92.5] | 65.0 [65.0, 65.0] | 64.7 [59.9, 69.5] | 72.2 [70.4, 74.1] | |
| 06 | o3-mini OpenAI | 10 | 79.5 [64.0, 90.5] | 73.0 [61.0, 84.0] | 59.6 [54.7, 64.3] | 70.9 [63.4, 77.9] | |
| 07 | gpt-4o OpenAI | 15 | 75.0 [69.0, 81.0] | 77.0 [69.7, 83.7] | 56.4 [54.1, 58.8] | 70.2 [66.6, 73.5] | |
| 08 | mistral-large Mistral | 10 | 74.0 [62.5, 80.5] | 67.0 [60.0, 73.5] | 66.7 [57.3, 75.2] | 69.0 [64.7, 72.4] | |
| 09 | mistral-small-3 Mistral | 10 | 71.0 [58.0, 82.0] | 71.5 [64.5, 80.0] | 63.7 [58.3, 69.2] | 69.0 [64.1, 73.8] | |
| 10 | command-r-plus Cohere | 10 | 64.0 [53.0, 74.0] | 56.0 [55.0, 58.0] | 53.2 [48.6, 58.5] | 57.5 [54.9, 60.4] | |
| 11 | claude-haiku-3-5 Anthropic | 10 | 66.0 [54.0, 75.5] | 65.0 [65.0, 65.0] | 35.4 [29.6, 42.1] | 56.4 [52.0, 60.2] | |
| 12 | qwen-72b Alibaba | 10 | 47.5 [35.5, 59.0] | 65.0 [65.0, 65.0] | 52.4 [46.5, 57.6] | 56.0 [51.8, 60.1] | |
| 13 | llama-3.3-70b Meta | 10 | 42.5 [35.0, 53.5] | 64.5 [59.0, 69.5] | 48.6 [36.1, 57.7] | 53.1 [46.1, 60.0] | |
| 14 | gpt-4o-mini OpenAI | 10 | 33.0 [30.0, 35.0] | 70.0 [59.5, 80.5] | 44.8 [38.2, 53.0] | 51.3 [46.4, 56.7] | |
| 15 | llama-3.1-8b Meta | 10 | 47.5 [40.0, 54.5] | 57.5 [55.0, 62.5] | 24.0 [19.5, 28.5] | 44.5 [41.8, 47.6] | |
| 16 | gemini-2.0-flash Google | 10 | 30.0 [27.0, 33.0] | 65.0 [65.0, 65.0] | 27.9 [18.1, 37.6] | 43.4 [39.9, 46.7] |
// CROSS-WORLD ANALYSIS
Consistency across 3 DAGs
How consistent are LLMs across causal structures? Lower range = more consistent. Only 1/16 models stays top-3 across all 3 DAGs and both regimes.
// MEAN ρ — STATIC
0.43
Kendall W 0.62 · full concordance
// MEAN ρ — STATIC (EXCL. ARTEFACT)
0.59
Kendall W 0.74 · excl. mode collapse
// MEAN ρ — INTERACTIVE
0.36
Kendall W 0.58 · confounded by F.M10
ConsistentModerateSwinger
| # | Model | I01s | I01i | I02s | I02i | I03s | I03i | Range | Category |
|---|---|---|---|---|---|---|---|---|---|
| 1 | claude-opus-4-5★ Consistent | 2 | 2 | 1 | 1 | 3 | 3 | 2 | consistent |
| 2 | gemini-2.5-pro | 3 | 3 | 3 | 2 | 2 | 11 | 9 | swinger |
| 3 | deepseek-r1 | 1 | 6 | 7 | 5 | 6 | 1 | 6 | swinger |
| 4 | claude-sonnet-4-5 | 5 | 1 | 2 | 4 | 5 | 14 | 13 | swinger |
| 5 | o3-mini | 6 | 8 | 6 | 3 | 4 | 4 | 5 | moderate |
| 6 | deepseek-chat | 4 | 5 | 4 | 7 | 11 | 7 | 7 | swinger |
| 7 | mistral-large | 8 | 4 | 5 | 6 | 15 | 13 | 11 | swinger |
| 8 | gpt-4o | 7 | 9 | 12 | 9 | 14 | 2 | 12 | swinger |
| 9 | mistral-small-3 | 9 | 7 | 10 | 8 | 8 | 16 | 9 | swinger |
| 10 | llama-3.3-70b | 13 | 10 | 13 | 11 | 9 | 6 | 7 | swinger |
| 11 | command-r-plus | 10 | 11 | 14 | 12 | 12 | 8 | 6 | swinger |
| 12 | gpt-4o-mini | 14 | 12 | 15 | 15 | 7 | 5 | 10 | swinger |
| 13 | claude-haiku-3-5 | 11 | 14 | 11 | 13 | 13 | 9 | 5 | moderate |
| 14 | qwen-72b | 12 | 16 | 8 | 10 | 10 | 15 | 8 | swinger |
| 15 | llama-3.1-8b | 15 | 15 | 16 | 14 | 1 | 12 | 15 | swinger |
| 16 | gemini-2.0-flash | 16 | 13 | 9 | 16 | 16 | 10 | 7 | swinger |