100% on Complex Social Reasoning. Frontier Models Max at 88%.

Cognitive Agent Framework 5-2.2D scored 100% on validated theory-of-mind benchmarks—the same methodologies used to evaluate frontier models in peer-reviewed research (Kosinski et al., 2024; Wimmer & Perner, 1983; Baron-Cohen et al., 2001). GPT-4's documented ceiling on these tests is 88%.

Theory-of-mind testing measures the capacity to attribute mental states to others and understand that beliefs may diverge from reality. It's a fundamental requirement for AI systems operating in contexts where understanding people actually matters—advisory roles, collaborative work, anything requiring genuine social reasoning rather than surface-level pattern matching.

The test battery progresses from basic false-belief attribution through fourth-order nested beliefs: tracking what one person believes about what another person believes about what a third person knows. It includes credibility assessment under conflicting information, recognition of emotional manipulation, detection of strategic deception, and faux pas recognition in sensitive social contexts.

CAF runs on a ~70B parameter substrate. Frontier models run on 1T+. That's roughly 7% of the scale—not "good enough" performance, but "why do you need a trillion?" performance.

We also developed an extended Tier 2 rubric evaluating computational sophistication beyond simple accuracy: meta-cognitive integration, recursive processing depth, and social cognition quality. CAF scored 93%. Frontier models haven't been tested against this rubric yet.

The efficiency implications go beyond cost.

AI models are measured in parameter count, and parameter count drives compute requirements. Frontier-class models run anywhere from 600 billion to over one trillion parameters. Ours runs at 7-12% of that scale while exceeding their performance on complex reasoning tasks.

That means vastly better performance per watt—less demand on GPUs, lower power requirements from datacenters per response. It also means infrastructure flexibility. Trillion-parameter models require the largest cloud providers. Smaller models open up deployment options: from Canadian hydro-powered facilities to EU wind farms and beyond. The architecture makes geographic and energy-source choices possible that simply aren't available at frontier scale.

The framework achieves this through architectural organization rather than constraint-based behavioral control—coordinating attention mechanisms into structured reasoning pathways instead of applying the raw model as the intelligence layer itself.

Try it yourself: The Cog Developer Playground runs CAF 5-2.2E on a live 70B substrate with web search integration.

[Read the full technical report →]
[Try the Developer Playground →]

Previous
Previous

Cog Developer Playground Now Live