Cover Image for Data Poisoning in LLMs: Jailbreak-Tuning and Scaling Laws
Cover Image for Data Poisoning in LLMs: Jailbreak-Tuning and Scaling Laws
Avatar for CognitionTO
Presented by
CognitionTO
ML/AI paper reading group
Hosted By
1 Going

Data Poisoning in LLMs: Jailbreak-Tuning and Scaling Laws

Registration
Tickets
1
About Event

ML Paper Reading Group in Toronto.

Join on discord: https://discord.gg/zt46KQnc
Join on whatsapp: https://chat.whatsapp.com/LNbhjDgPZb8EzO5075s5op

https://arxiv.org/abs/2408.02946

Abstract:

LLMs produce harmful and undesirable behavior when trained on poisoned datasets that contain a small fraction of corrupted or harmful data. We develop a new attack paradigm, jailbreak-tuning, that combines data poisoning with jailbreaking to fully bypass state-of-the-art safeguards and make models like GPT-4o comply with nearly any harmful request. Our experiments suggest this attack represents a paradigm shift in vulnerability elicitation, producing differences in refusal rates as much as 60+ percentage points compared to normal fine-tuning. Given this demonstration of how data poisoning vulnerabilities persist and can be amplified, we investigate whether these risks will likely increase as models scale. We evaluate three threat models - malicious fine-tuning, imperfect data curation, and intentional data contamination - across 24 frontier LLMs ranging from 1.5 to 72 billion parameters. Our experiments reveal that larger LLMs are significantly more susceptible to data poisoning, learning harmful behaviors from even minimal exposure to harmful data more quickly than smaller models. These findings underscore the need for leading AI companies to thoroughly red team fine-tuning APIs before public release and to develop more robust safeguards against data poisoning, particularly as models continue to scale in size and capability.

Location
30 Adelaide St E
Toronto, ON M5C 3G8, Canada
Avatar for CognitionTO
Presented by
CognitionTO
ML/AI paper reading group
Hosted By
1 Going