// CAUSAL EVALUATION FOR PRODUCTION LLMS

Test if your LLM truly reasons — or merely pattern-matches

Today's LLMs are statistical. We test if they're causal. Run on 16 leading models. Now applied to your domain.

Active discussions with European financial institutions

// WHY LLMGAUNTLET

Public proof.
Private evaluation.

Methodology proven publicly

We validate every approach on a public benchmark across 16 LLMs and 3 causal DAGs. The findings are open. The methodology has been stress-tested at scale.

View public findings

Built for your vertical

Custom worlds — mediator, collider, confounder — designed for YOUR domain. Finance, healthcare, legal, public sector. The same rigor, applied to your specific causal structures.

See offerings

Detects what generic benchmarks miss

Pattern matching looks fluent until you cross-test it. Our anti-pattern-matching framework catches mode collapse and structural fragility — flaws that single-DAG benchmarks miss entirely.

Read methodology

// HOW IT WORKS

From your DAG to your evaluation report

Define your DAG

You bring the domain. We map the causal relationships that matter — confounders, mediators, colliders.

Build your worlds

We construct 3 custom evaluation worlds calibrated to your structure. Same Pearl framework, your specific use case.

Evaluate your models

Bootstrap CI. Cross-world consistency. Anti-pattern-matching checks. Same methodology as our public benchmark.

Get the report

Per-world scores. Failure modes identified. Comparison vs public baseline. Actionable insights for your AI team.

// METHODOLOGY VALIDATED

What we already revealed about 16 production LLMs

These findings emerged from our public benchmark. The same methodology applies to your stack.

F.B05 — WORLD-INVARIANT

“1 model out of 16 truly understands causality.”

Only claude-opus-4-5 stays top-3 across all 3 DAGs and both regimes (ranks 2,1,3 static; 2,1,3 interactive). 13/16 swing dramatically.

F.B07 — TOP-TIER CRASH

“Interactive feedback reveals overconfident top models.”

Static top-5 lose -19 to -33 IQ points in interactive. gemini-2.5-pro: -30.7. claude-opus-4-5: -23.7. o3-mini: -22.1.

F.B08 — CONVERGENCE

“Multi-turn flattens the field 3.3x.”

Static IQ spans 36 points across 16 models. Interactive compresses that range to 11 points — a 3.3x flattening. The top crashes, the bottom rises, and everyone converges to mid-tier performance.

Apply this to your stack

The same rigor, calibrated to your domain.

// READY TO TEST YOUR STACK ?

Get a causal evaluation built for your domain

We have bandwidth for 2-3 pilot engagements this quarter. Custom Worlds start at €25K.

Book a call View public benchmark