LLMGauntlet
// SCIENTIFIC FINDINGS

What 16 LLMs revealed across 3 DAGs

8 findings from our cross-world causal reasoning benchmark — including one we publicly refuted. Each card opens to its method, interpretation, and caveats. Commercial worlds (Island_02, Island_03) are reported as ranks and deltas only.

F.B05

1 model out of 16 truly understands causality

claude-opus-4-5 is the only LLM staying top-3 across all 3 DAGs and both regimes.
Static ranks
2 · 1 · 3
Interactive ranks
2 · 1 · 3
World-invariant
1 / 16
Read details
Method

Spearman rank correlation computed three ways across worlds, plus a per-model rank range across all six (DAG × regime) cells. World-invariance = staying top-3 in every cell.

Interpretation

Generalized causal reasoning transfers across mediator, collider, and confounder structures. claude-opus-4-5 is the only model that does so — the rest re-rank when the causal structure changes, the signature of pattern matching.

Caveat

N = 1. With a single world-invariant model, the sample size of the “world-invariance” class is itself one — a signal to confirm, not a population statistic.

F.B10

81% of LLMs are exploiting DAG patterns, not reasoning

Most LLMs change rank dramatically when the causal structure changes. They're not reasoning — they're pattern matching.
Consistent
1 / 16
Moderate
2 / 16
Swingers
13 / 16
Read details
Examples

deepseek-r1 swings #1 → #7 → #6 across the three static worlds; gpt-4o slides #7 → #12 → #14. Rank is not stable under a change of DAG.

Implication

Single-DAG benchmarks are insufficient. A model can top one causal structure and collapse on the next — only cross-world evaluation surfaces it.

F.X02REFUTED

Methodology auto-correction: F.X02 refuted at N=3

At N=2, we claimed interactive was more cross-world consistent. Adding Island_03 reversed this.
N=2 (Phase C.3)
0.85 > 0.76
N=3 (Phase H.6)
0.59 > 0.36
Verdict
Refuted
Read details
Lesson

At N=2, ρ interactive (0.847) > ρ static (0.759). At N=3, once the artefact is excluded, ρ static (0.587) > ρ interactive (0.364). The claim flipped.

Implication

Findings at small N can reverse when expanded. We recommend a minimum of N=3 worlds before making any cross-world claim.

Honesty signal

We publish refuted findings transparently — the failure to replicate is part of the record, not an edit to it.

F.X06

Confounder is a different competence than mediator/collider

ρ(mediator × collider) = 0.76. ρ(× confounder) drops to 0.20–0.34. Confounding tests something fundamentally different.
ρ static I01×I02
0.73
ρ static I01×I03
0.57
ρ interactive I02×I03
0.14
Static (artifact-excluded)
I01
I02
I03
I01
1.00
0.73
0.57
I02
0.73
1.00
0.46
I03
0.57
0.46
1.00
Interactive
I01
I02
I03
I01
1.00
0.85
0.11
I02
0.85
1.00
0.14
I03
0.11
0.14
1.00
ρ > 0.7 0.3–0.7 < 0.3
Read details
Why this matters

Pearl's framework treats confounding as a distinct rung from mediation and collision. Our benchmark confirms it empirically: model rankings on the confounder world decouple from the other two.

Caveat

The interactive matrix is partially confounded by the F.M10 power floor — multi-turn compresses the score range, which limits how much rank discrimination Spearman can detect.

F.B07

Interactive feedback breaks overconfident top models

The static top-5 lose 19 to 33 IQ points in interactive mode. Multi-turn reveals capacity for doubt, not capacity for calculation.
Δ range (top-5)
−19 to −33
Biggest drop
−30.7
Strongest on
Island_03
Read details
Method

Per model, compare static IQ against interactive IQ; the bar shows the delta (negative = drop) for the static top performers.

Interpretation

Top models commit to a wrong answer and defend it through the feedback turns; weaker models revise. Multi-turn rewards doubt over confident calculation.

Caveat

The effect is strongest on Island_03 (confounder) and weaker on Island_01 (mediator). Deltas are reported; absolute commercial scores are not.

F.B02

deepseek-r1: #1 on familiar DAGs, fragile to novel structures

DeepSeek-R1 tops Island_01 (mediator) but crashes on Island_02 (collider) and Island_03 (confounder).
Island_01 (mediator)
#1
Island_02 (collider)
#7
Island_03 (confounder)
#6
Read details
Hypothesis

Pattern-matched mediator reasoning that does not generalize: the moment the DAG stops looking like a mediator chain, the ranking collapses.

Compare

Contrast with claude-opus-4-5 (F.B05), which holds top-3 everywhere — consistency, not peak, is the signal of generalized causal reasoning.

F.B08

Interactive flattens the field 3.3×

Static IQ spans 36 points. Interactive compresses that range to 11 points. The top crashes, the bottom rises.
Static range
36 pts
Interactive range
11 pts
Flattening
3.3×
Read details
Why

Feedback aids weak models (more revisions toward the answer) and hurts strong ones (they dig into a wrong commitment). The field converges to mid-tier.

Caveat

Range compression also limits Spearman discrimination (F.M10): a flatter field is intrinsically harder to rank-correlate.

F.M06 / F.L01

How a tiny 8B model accidentally outranked Opus

llama-3.1-8b topped Island_03 static — not by reasoning, but by rounding to safe numbers that happened to fall in our tolerance windows.
Island_01
#15
Island_02
#16
Island_03
#1
Read details
Pattern

A default-rounding heuristic combined with numerical scoring over bounded tolerance windows produces an artefact: safe round numbers occasionally land inside the window.

Mitigation

The cross-world consistency check (F.M07) detects it — a model ranked #15/#16 on two worlds and #1 on the third is flagged, not celebrated.

Lesson

Single-DAG numerical benchmarks are vulnerable to mode collapse. We acknowledge this limitation as F.L01 in our README.