1 model out of 16 truly understands causality
“claude-opus-4-5 is the only LLM staying top-3 across all 3 DAGs and both regimes.”
- Static ranks
- 2 · 1 · 3
- Interactive ranks
- 2 · 1 · 3
- World-invariant
- 1 / 16
8 findings from our cross-world causal reasoning benchmark — including one we publicly refuted. Each card opens to its method, interpretation, and caveats. Commercial worlds (Island_02, Island_03) are reported as ranks and deltas only.
“claude-opus-4-5 is the only LLM staying top-3 across all 3 DAGs and both regimes.”
“Most LLMs change rank dramatically when the causal structure changes. They're not reasoning — they're pattern matching.”
“At N=2, we claimed interactive was more cross-world consistent. Adding Island_03 reversed this.”
“ρ(mediator × collider) = 0.76. ρ(× confounder) drops to 0.20–0.34. Confounding tests something fundamentally different.”
“The static top-5 lose 19 to 33 IQ points in interactive mode. Multi-turn reveals capacity for doubt, not capacity for calculation.”
“DeepSeek-R1 tops Island_01 (mediator) but crashes on Island_02 (collider) and Island_03 (confounder).”
“Static IQ spans 36 points. Interactive compresses that range to 11 points. The top crashes, the bottom rises.”
“llama-3.1-8b topped Island_03 static — not by reasoning, but by rounding to safe numbers that happened to fall in our tolerance windows.”