Apple research finds multi-agent LLM teams underperform single experts by up to 41%

Apple's ML researchers threw cold water on the multi-agent hype. Their study of autonomous multi-agent LLM systems — where coordination emerges through interaction rather than fixed roles — found that enforced coordination can actively constrain performance, with teams underperforming the single best expert by up to 41.1% on ML benchmarks. The culprit is consensus-seeking: agents converge toward agreement rather than toward correctness, diluting the strongest contributor.
The finding is a pointed counter to the industry's rush toward agent swarms and 'orchestrator' architectures, and it dovetails with Apple's comparatively conservative product stance (a device-gated, opt-in Siri AI shipping this fall). It suggests that naive multi-agent designs may add cost and latency without accuracy gains — an important caveat as vendors like NVIDIA (ASPIRE), xAI, and AWS push agentic systems hard.
Apple paired the critique with two research artifacts. VideoFlexTok is a coarse-to-fine, flexible-length video tokenizer that rethinks the standard spatiotemporal 3D grid, reportedly enabling 10-second generation with 8x fewer tokens — directly relevant to making generative video cheaper. MemoryLLM decouples feed-forward modules from self-attention to enable interpretable, plug-and-play memory, treating FFN layers as retrieval memory.
Competitive context: while Google and OpenAI ship agentic products, Apple is publishing research that questions their foundations — consistent with its 'measure twice' posture. Skeptical takes: the 41% figure is benchmark-specific and doesn't prove multi-agent systems are useless, only that current coordination methods are flawed. What to watch: whether these papers translate into Apple products and how the multi-agent community responds to the consensus-seeking critique.