Test if your LLM truly reasons — or merely pattern-matches
Public proof.
Private evaluation.
Methodology proven publicly
We validate every approach on a public benchmark across 16 LLMs and 3 causal DAGs. The findings are open. The methodology has been stress-tested at scale.
View public findings (Methodology proven publicly)Built for your vertical
Custom worlds — mediator, collider, confounder — designed for YOUR domain. Finance, healthcare, legal, public sector. The same rigor, applied to your specific causal structures.
See offerings (Built for your vertical)Detects what generic benchmarks miss
Pattern matching looks fluent until you cross-test it. Our anti-pattern-matching framework catches mode collapse and structural fragility — flaws that single-DAG benchmarks miss entirely.
Read methodology (Detects what generic benchmarks miss)From your DAG to your evaluation report
Define your DAG
You bring the domain. We map the causal relationships that matter — confounders, mediators, colliders.
Build your worlds
We construct 3 custom evaluation worlds calibrated to your structure. Same Pearl framework, your specific use case.
Evaluate your models
Bootstrap CI. Cross-world consistency. Anti-pattern-matching checks. Same methodology as our public benchmark.
Get the report
Per-world scores. Failure modes identified. Comparison vs public baseline. Actionable insights for your AI team.
What we already revealed about 16 production LLMs
These findings emerged from our public benchmark. The same methodology applies to your stack.
“1 model out of 16 truly understands causality.”
Only claude-opus-4-5 stays top-3 across all 3 DAGs and both regimes (ranks 2,1,3 static; 2,1,3 interactive). 13/16 swing dramatically.
Read more→ about F.B05 — WORLD-INVARIANT“Interactive feedback reveals overconfident top models.”
Static top-5 lose -19 to -33 IQ points in interactive. gemini-2.5-pro: -30.7. claude-opus-4-5: -23.7. o3-mini: -22.1.
Read more→ about F.B07 — TOP-TIER CRASH“Multi-turn flattens the field 3.3x.”
Static IQ spans 36 points across 16 models. Interactive compresses that range to 11 points — a 3.3x flattening. The top crashes, the bottom rises, and everyone converges to mid-tier performance.
Read more→ about F.B08 — CONVERGENCEThe same rigor, calibrated to your domain.
Get a causal evaluation built for your domain
We have bandwidth for 2-3 pilot engagements this quarter. Custom Worlds start at €25K.