Cover Image for Hackathon: Alignment Faking Model Organisms
Cover Image for Hackathon: Alignment Faking Model Organisms
6 Going

Hackathon: Alignment Faking Model Organisms

Hosted by Annie Szorkin & LISA
Registration
Welcome! To join the event, please register below.
About Event

Important registration information: ​​To participate in this event, please join the discord link before registering.

This event is open to LISA members only.

Many safety and governance measures rely on AI models showing us their true colours. "Alignment faking" is the phenomenon of a model hiding misaligned behaviour when it believes it's being observed.

​In this hackathon, we will be constructing model organisms of alignment faking: realistic, experimentally-verified pathways under which alignment faking can occur. We'll be test-driving a new framework for alignment faking experiments. The environment, monitoring and scoring are already set up - all we need to do is supply the models! These can be fine-tunes of open source models or simple prompt engineering.

We will be at LISA, so please make sure to read and comply with the Events Code of Conduct.

Bring a laptop (beefy GPUs are not necessary, we'll provide credits for API-based finetuning of open source models so you don't need to run them locally).

More information on the Alignment Faking Hackathons notion page

Location
London Initiative for Safe AI (LISA)
6 Going