Paper Club: Safety Alignment Should Be Made More Than Just A Few Tokens Deep

Name: Paper Club: Safety Alignment Should Be Made More Than Just A Few Tokens Deep
Start: 2025-05-08T17:00:00.000+08:00
End: 2025-05-08T18:00:00.000+08:00
Location: 22 Cross St

Singapore AI Safety Hub (SASH) Events

22 Cross St

Singapore

Past Event

Welcome! To join the event, please register below.

You will be asked to verify token ownership with your wallet.

About Event

Technical Note: This event is intended for participants with a technical background. We strongly encourage reading the paper ahead of time to fully engage with the discussion.

Join us as we dive into the ICLR 2025 paper, "Safety Alignment Should Be Made More Than Just a Few Tokens Deep,", which won an Outstanding Paper Award at the conference. This paper critically examines why current Large Language Models (LLMs) remain vulnerable despite extensive safety alignment training. The researchers pinpoint a fundamental shortcoming: alignment often impacts only the initial tokens of generated outputs, a phenomenon they term "shallow safety alignment." Through detailed case studies, the paper demonstrates how this issue underpins numerous vulnerabilities, such as adversarial suffix attacks, prefilling attacks, and fine-tuning attacks. To address these vulnerabilities, the authors propose methods like training models with augmented data that encourages deeper safety alignment and using constrained fine-tuning objectives that protect initial token distributions. Join us to explore these compelling insights and discuss future directions for creating more robustly aligned language models.

Paper Link: https://openreview.net/pdf?id=6Mxhg9PtDE

Location

22 Cross St

Singapore 048421

Presented by

Singapore AI Safety Hub (SASH) Events

Hosted By

20 Went

AI