![Cover Image for [Paper Reading] Reasoning Models Don't Always Say What They Think](https://images.lumacdn.com/cdn-cgi/image/format=auto,fit=cover,dpr=2,background=white,quality=75,width=400,height=400/event-covers/o7/a89f29b2-65b6-4214-a61e-87e1a03b120a.jpg)
![Cover Image for [Paper Reading] Reasoning Models Don't Always Say What They Think](https://images.lumacdn.com/cdn-cgi/image/format=auto,fit=cover,dpr=2,background=white,quality=75,width=400,height=400/event-covers/o7/a89f29b2-65b6-4214-a61e-87e1a03b120a.jpg)
[Paper Reading] Reasoning Models Don't Always Say What They Think
This week, we will walk through and discuss the paper:
Reasoning models don't always say what they think [https://www.anthropic.com/research/reasoning-models-dont-say-think]
Abstract of the paper:
Chain-of-thought (CoT) offers a potential boon for AI safety as it allows monitoring a model’s CoT to try to understand its intentions and reasoning processes. However, the effectiveness of such monitoring hinges on CoTs faithfully representing models’ actual reasoning processes. We evaluate CoT faithfulness of state-of-the-art reasoning models across 6 reasoning hints presented in the prompts and find: (1) for most settings and models tested, CoTs reveal their usage of hints in at least 1% of examples where they use the hint, but the reveal rate is often below 20%, (2) outcome-based reinforcement learning initially improves faithfulness but plateaus without saturating, and (3) when reinforcement learning increases how frequently hints are used (reward hacking), the propensity to verbalize them does not increase, even without training against a CoT monitor. These results suggest that CoT monitoring is a promising way of noticing undesired behaviors during training and evaluations, but that it is not sufficient to rule them out. They also suggest that in settings like ours where CoT reasoning is not necessary, test-time monitoring of CoTs is unlikely to reliably catch rare and catastrophic unexpected behaviors.
----------
We are a group of applied AI practitioners and enthusiasts who have formed a collective learning community. Every Wednesday evening at PM PST, we hold our research paper reading seminar covering an AI topic. One member carefully explains the paper, making it more accessible to a broader audience. Then, we follow this reading with a more informal discussion and socializing.
You are welcome to join this in person or over Zoom (https://us02web.zoom.us/meeting/register/tZUvf-uvrTwvHdP9B-vE03j3BapgRypn64CS). SupportVectors is an AI training lab located in Fremont, CA, close to Tesla and easily accessible by road and BART. We follow the weekly sessions with snacks, soft drinks, and informal discussions.