// MODEL LEADERBOARD

16 LLMs on Pearl's causal ladder

Evaluated on 3 causal DAGs (mediator, collider, confounder) across 2 regimes (static, interactive). Island_01 shows full scores; Island_02 and Island_03 are commercial worlds — ranks and tier badges only.

16 models

		n	P1	P2	P3
01	deepseek-r1 DeepSeek	10	93.5 [86.5, 98.0]	82.5 [75.5, 88.0]	70.0 [61.0, 78.9]	82.0 [79.3, 84.7]
02	claude-opus-4-5 Anthropic	10	91.5 [85.5, 97.0]	76.0 [68.5, 83.0]	65.9 [62.1, 70.2]	77.6 [75.3, 80.0]
03	gemini-2.5-pro Google	7	72.2 [48.9, 90.0]	79.4 [72.2, 86.1]	68.7 [59.9, 73.4]	75.8 [70.6, 80.8]
04	deepseek-chat DeepSeek	10	85.5 [78.0, 91.0]	65.5 [61.0, 71.5]	72.1 [65.4, 78.5]	73.5 [69.4, 77.7]
05	claude-sonnet-4-5 Anthropic	10	89.5 [87.0, 92.5]	65.0 [65.0, 65.0]	64.7 [59.9, 69.5]	72.2 [70.4, 74.1]
06	o3-mini OpenAI	10	79.5 [64.0, 90.5]	73.0 [61.0, 84.0]	59.6 [54.7, 64.3]	70.9 [63.4, 77.9]
07	gpt-4o OpenAI	15	75.0 [69.0, 81.0]	77.0 [69.7, 83.7]	56.4 [54.1, 58.8]	70.2 [66.6, 73.5]
08	mistral-large Mistral	10	74.0 [62.5, 80.5]	67.0 [60.0, 73.5]	66.7 [57.3, 75.2]	69.0 [64.7, 72.4]
09	mistral-small-3 Mistral	10	71.0 [58.0, 82.0]	71.5 [64.5, 80.0]	63.7 [58.3, 69.2]	69.0 [64.1, 73.8]
10	command-r-plus Cohere	10	64.0 [53.0, 74.0]	56.0 [55.0, 58.0]	53.2 [48.6, 58.5]	57.5 [54.9, 60.4]
11	claude-haiku-3-5 Anthropic	10	66.0 [54.0, 75.5]	65.0 [65.0, 65.0]	35.4 [29.6, 42.1]	56.4 [52.0, 60.2]
12	qwen-72b Alibaba	10	47.5 [35.5, 59.0]	65.0 [65.0, 65.0]	52.4 [46.5, 57.6]	56.0 [51.8, 60.1]
13	llama-3.3-70b Meta	10	42.5 [35.0, 53.5]	64.5 [59.0, 69.5]	48.6 [36.1, 57.7]	53.1 [46.1, 60.0]
14	gpt-4o-mini OpenAI	10	33.0 [30.0, 35.0]	70.0 [59.5, 80.5]	44.8 [38.2, 53.0]	51.3 [46.4, 56.7]
15	llama-3.1-8b Meta	10	47.5 [40.0, 54.5]	57.5 [55.0, 62.5]	24.0 [19.5, 28.5]	44.5 [41.8, 47.6]
16	gemini-2.0-flash Google	10	30.0 [27.0, 33.0]	65.0 [65.0, 65.0]	27.9 [18.1, 37.6]	43.4 [39.9, 46.7]

// CROSS-WORLD ANALYSIS

Consistency across 3 DAGs

How consistent are LLMs across causal structures? Lower range = more consistent. Only 1/16 models stays top-3 across all 3 DAGs and both regimes.

// MEAN ρ — STATIC

0.43

Kendall W 0.62 · full concordance

// MEAN ρ — STATIC (EXCL. ARTEFACT)

0.59

Kendall W 0.74 · excl. mode collapse

// MEAN ρ — INTERACTIVE

0.36

Kendall W 0.58 · confounded by F.M10

ConsistentModerateSwinger

#	Model	I01s	I01i	I02s	I02i	I03s	I03i	Range	Category
1	claude-opus-4-5★ Consistent	2	2	1	1	3	3	2	consistent
2	gemini-2.5-pro	3	3	3	2	2	11	9	swinger
3	deepseek-r1	1	6	7	5	6	1	6	swinger
4	claude-sonnet-4-5	5	1	2	4	5	14	13	swinger
5	o3-mini	6	8	6	3	4	4	5	moderate
6	deepseek-chat	4	5	4	7	11	7	7	swinger
7	mistral-large	8	4	5	6	15	13	11	swinger
8	gpt-4o	7	9	12	9	14	2	12	swinger
9	mistral-small-3	9	7	10	8	8	16	9	swinger
10	llama-3.3-70b	13	10	13	11	9	6	7	swinger
11	command-r-plus	10	11	14	12	12	8	6	swinger
12	gpt-4o-mini	14	12	15	15	7	5	10	swinger
13	claude-haiku-3-5	11	14	11	13	13	9	5	moderate
14	qwen-72b	12	16	8	10	10	15	8	swinger
15	llama-3.1-8b	15	15	16	14	1	12	15	swinger
16	gemini-2.0-flash	16	13	9	16	16	10	7	swinger