Cover Image for Paper Club: AI Control — Safety Beyond Alignment
Cover Image for Paper Club: AI Control — Safety Beyond Alignment
7 Going

Paper Club: AI Control — Safety Beyond Alignment

Registration
Welcome! To join the event, please register below.
About Event

How can we ensure AI systems remain safe even when they might actively try to bypass the safety measures we put in place? This paper introduces the idea of "control evaluations," which test whether safety protocols remain effective when AI systems intentionally attempt to subvert them. By using a blue-team/red-team approach on LLMs, the researchers develop several safety protocols that perform well against increasingly sophisticated adversarial strategies.

This work establishes what is known as the "control agenda" in AI safety. While alignment research focuses on making AI systems intrinsically safe by aligning their goals with human values, the control approach develops external safeguards that work even if misalignment occurs. Control serves as an exciting complementary research direction to alignment, offering additional layers of protection as AI systems become more powerful and potentially autonomous.

The paper can be found here: https://arxiv.org/abs/2312.06942

A paper club is where participants read the paper beforehand and during the paper club, a facilitator will lead a discussion about the concepts discussed in the paper.

This is a prequel event to the AI Control Hackathon:

Mar
28
AI Control Hackathon 2025
Fri, Mar 28, 7:00 PM GMT+8
Location
22 Cross St
Singapore 048421
7 Going