Paper Club: AI Control — Safety Beyond Alignment

Name: Paper Club: AI Control — Safety Beyond Alignment
Start: 2025-03-27T15:30:00.000+08:00
End: 2025-03-27T16:30:00.000+08:00
Location: 22 Cross St

Singapore AI Safety Hub (SASH) Events

22 Cross St

Singapore

Past Event

Welcome! To join the event, please register below.

You will be asked to verify token ownership with your wallet.

About Event

How can we ensure AI systems remain safe even when they might actively try to bypass the safety measures we put in place? This paper introduces the idea of "control evaluations," which test whether safety protocols remain effective when AI systems intentionally attempt to subvert them. By using a blue-team/red-team approach on LLMs, the researchers develop several safety protocols that perform well against increasingly sophisticated adversarial strategies.

This work establishes what is known as the "control agenda" in AI safety. While alignment research focuses on making AI systems intrinsically safe by aligning their goals with human values, the control approach develops external safeguards that work even if misalignment occurs. Control serves as an exciting complementary research direction to alignment, offering additional layers of protection as AI systems become more powerful and potentially autonomous.

The paper can be found here: https://arxiv.org/abs/2312.06942

A paper club is where participants read the paper beforehand and during the paper club, a facilitator will lead a discussion about the concepts discussed in the paper.

This is a prequel event to the AI Control Hackathon:

Mar

28

AI Control Hackathon 2025

Location

22 Cross St

Singapore 048421

Presented by

Singapore AI Safety Hub (SASH) Events

Hosted By

22 Went

AI