Data Poisoning in LLMs: Jailbreak-Tuning and Scaling Laws

Name: Data Poisoning in LLMs: Jailbreak-Tuning and Scaling Laws
Start: 2025-02-10T17:30:00.000-05:00
End: 2025-02-10T20:30:00.000-05:00
Location: 30 Adelaide St E

CognitionTO

30 Adelaide St E

Toronto, Ontario

Tickets

1

You will be asked to verify token ownership with your wallet.

About Event

ML Paper Reading Group in Toronto.

Join on discord: https://discord.gg/zt46KQnc
Join on whatsapp: https://chat.whatsapp.com/LNbhjDgPZb8EzO5075s5op

https://arxiv.org/abs/2408.02946

Abstract:

LLMs produce harmful and undesirable behavior when trained on poisoned datasets that contain a small fraction of corrupted or harmful data. We develop a new attack paradigm, jailbreak-tuning, that combines data poisoning with jailbreaking to fully bypass state-of-the-art safeguards and make models like GPT-4o comply with nearly any harmful request. Our experiments suggest this attack represents a paradigm shift in vulnerability elicitation, producing differences in refusal rates as much as 60+ percentage points compared to normal fine-tuning. Given this demonstration of how data poisoning vulnerabilities persist and can be amplified, we investigate whether these risks will likely increase as models scale. We evaluate three threat models - malicious fine-tuning, imperfect data curation, and intentional data contamination - across 24 frontier LLMs ranging from 1.5 to 72 billion parameters. Our experiments reveal that larger LLMs are significantly more susceptible to data poisoning, learning harmful behaviors from even minimal exposure to harmful data more quickly than smaller models. These findings underscore the need for leading AI companies to thoroughly red team fine-tuning APIs before public release and to develop more robust safeguards against data poisoning, particularly as models continue to scale in size and capability.

Location

30 Adelaide St E

Toronto, ON M5C 3G8, Canada

Presented by

CognitionTO

ML/AI paper reading group

Hosted By

1 Going

AI