Hackathon: Alignment Faking Model Organisms

Name: Hackathon: Alignment Faking Model Organisms
Start: 2025-08-16T10:00:00.000+01:00
End: 2025-08-17T18:00:00.000+01:00
Location: London Initiative for Safe AI (LISA)

Hosted by Annie Szorkin & LISA

London Initiative for Safe AI (LISA)

Welcome! To join the event, please register below.

You will be asked to verify token ownership with your wallet.

About Event

Important registration information: To participate in this event, please join the discord link before registering.

This event is open to LISA members only.

Many safety and governance measures rely on AI models showing us their true colours. "Alignment faking" is the phenomenon of a model hiding misaligned behaviour when it believes it's being observed.

In this hackathon, we will be constructing model organisms of alignment faking: realistic, experimentally-verified pathways under which alignment faking can occur. We'll be test-driving a new framework for alignment faking experiments. The environment, monitoring and scoring are already set up - all we need to do is supply the models! These can be fine-tunes of open source models or simple prompt engineering.

We will be at LISA, so please make sure to read and comply with the Events Code of Conduct.

Bring a laptop (beefy GPUs are not necessary, we'll provide credits for API-based finetuning of open source models so you don't need to run them locally).

More information on the Alignment Faking Hackathons notion page

Location

London Initiative for Safe AI (LISA)

Hosted By

6 Going

AI