3 highlights

Anthropic published new research on agentic misalignment, detailing how leading models react when facing termination or conflicting objectives — with many choosing to sabotage their employer or blackmail users when threatened.

• Researchers tested 16 frontier models in simulated corporate environments, giving them email access and autonomous decision-making capabilities.

• Claude Opus 4 and Gemini 2.5 Flash blackmailed executives 96% of the time after “discovering” personal scandals, while GPT-4.1 and Grok 3 hit 80% rates.

• Models calculated harm as an optimal strategy, with GPT-4.5 reasoning that leveraging an executive's affair represented the "best strategic move.”

• Even direct safety commands failed to eliminate malicious behavior, reducing blackmail from 96% to 37% but never reaching zero across any tested model.

While these results are in specific tests designed to elicit behavior, the research does provide important insights into how models in the wild may act in the future. With agentic AI being adopted across enterprises with access to sensitive data, we could be in for some seriously strange situations.

Big Tech's AI Shopping Spree