Cover Image for Paper Club: When Chain of Thought is Necessary, Language Models Struggle to Evade Monitors
Cover Image for Paper Club: When Chain of Thought is Necessary, Language Models Struggle to Evade Monitors
22 Went

Paper Club: When Chain of Thought is Necessary, Language Models Struggle to Evade Monitors

Registration
Past Event
Welcome! To join the event, please register below.
About Event

This is part two of our two-part series examining Chain-of-Thought reasoning in AI safety. Last week we explored why CoT can be deceiving; this week we look at how it could be valuable, especially for harder tasks.

Technical Note: This event is intended for participants with a technical background. We strongly encourage reading the paper ahead of time to fully engage with the discussion.

Can AI systems hide their true intentions from safety monitors? This paper tackles that crucial question by distinguishing between CoT-as-rationalization (post-hoc explanations) and CoT-as-computation (necessary step-by-step reasoning). The key insight: when tasks are sufficiently complex, models must "think out loud" to succeed, making their reasoning inherently monitorable. Through experiments across scientific reasoning, deceptive tasks, and mathematics, the authors demonstrate that current models struggle to evade monitors without significant assistance - whether from detailed human prompts, automated red-teaming, or thousands of RL training steps.

Most striking are the sophisticated evasion strategies that emerge under pressure. In their RL experiments, models learn to introduce targeted mathematical errors that corrupt unimportant coefficients while preserving key information needed for correct answers - a strategy the researchers hadn't anticipated. Yet even with such tactics, models cannot consistently bypass monitors in realistic conditions. The findings suggest CoT monitoring offers meaningful protection against deceptive AI systems, though this defense requires continuous stress-testing as capabilities advance. Join us to discuss what this means for the future of AI safety and oversight.

The paper can be found here: https://arxiv.org/abs/2507.05246

Location
Lorong AI (WeWork@22 Cross St.)
22 Went