Cover Image for Paper Club: Safety Alignment Should Be Made More Than Just A Few Tokens Deep
Cover Image for Paper Club: Safety Alignment Should Be Made More Than Just A Few Tokens Deep
21 Went

Paper Club: Safety Alignment Should Be Made More Than Just A Few Tokens Deep

Registration
Past Event
Welcome! To join the event, please register below.
About Event

Technical Note: This event is intended for participants with a technical background. We strongly encourage reading the paper ahead of time to fully engage with the discussion.

Join us as we dive into the ICLR 2025 paper, "Safety Alignment Should Be Made More Than Just a Few Tokens Deep,", which won an Outstanding Paper Award at the conference. This paper critically examines why current Large Language Models (LLMs) remain vulnerable despite extensive safety alignment training. The researchers pinpoint a fundamental shortcoming: alignment often impacts only the initial tokens of generated outputs, a phenomenon they term "shallow safety alignment." Through detailed case studies, the paper demonstrates how this issue underpins numerous vulnerabilities, such as adversarial suffix attacks, prefilling attacks, and fine-tuning attacks. To address these vulnerabilities, the authors propose methods like training models with augmented data that encourages deeper safety alignment and using constrained fine-tuning objectives that protect initial token distributions. Join us to explore these compelling insights and discuss future directions for creating more robustly aligned language models.

Paper Link: https://openreview.net/pdf?id=6Mxhg9PtDE

Location
22 Cross St
Singapore 048421
21 Went