

Paper Club: AI Control — Safety Beyond Alignment
How can we ensure AI systems remain safe even when they might actively try to bypass the safety measures we put in place? This paper introduces the idea of "control evaluations," which test whether safety protocols remain effective when AI systems intentionally attempt to subvert them. By using a blue-team/red-team approach on LLMs, the researchers develop several safety protocols that perform well against increasingly sophisticated adversarial strategies.
This work establishes what is known as the "control agenda" in AI safety. While alignment research focuses on making AI systems intrinsically safe by aligning their goals with human values, the control approach develops external safeguards that work even if misalignment occurs. Control serves as an exciting complementary research direction to alignment, offering additional layers of protection as AI systems become more powerful and potentially autonomous.
The paper can be found here: https://arxiv.org/abs/2312.06942
A paper club is where participants read the paper beforehand and during the paper club, a facilitator will lead a discussion about the concepts discussed in the paper.
This is a prequel event to the AI Control Hackathon: