Anthropic's "Agentic Misalignment: How LLMs could be insider threats"
Anthropic recently released a report "Agentic Misalignment: How LLMs could be insider threats" in which it tested 16 models from different providers to see how agents would behave autonomously.
The agents were allowed to act autonomously, for example, send emails and access sensitive data. The assigned goals were harmless.
Then the Anthropic team tested whether they would act against companies in scenarios when they could be replaced with an updated version or when their assigned goal conflicted with the company's changing direction.
In this conversation, Aengus Lynch from University College London, who actively collaborated with the Anthropic team and is a core contributor to this work, will share with the BuzzRobot community the details of this work.
Read "Agentic Misalignment: How LLMs could be insider threats"
Join the BuzzRobot community on Slack