Cover Image for Hackathon: Alignment Faking

Presented by

Catalyzing Toronto's role in steering AI progress toward a future of human flourishing. Join us for a variety of events on technical AI safety, governance in a world of advanced AI, and more.

Hosted By

18 Went

AI

Hackathon: Alignment Faking

Name: Hackathon: Alignment Faking
Start: 2025-08-16T10:00:00.000-04:00
End: 2025-08-17T18:00:00.000-04:00
Location: 30 Adelaide St E 12th floor

Trajectory Labs

30 Adelaide St E 12th floor

Toronto, Ontario

Past Event

Welcome! To join the event, please register below.

You will be asked to verify token ownership with your wallet.

About Event

Important registration information: To participate in this event, please join the discord link before registering.

Many safety and governance measures rely on AI models showing us their true colours. "Alignment faking" is the phenomenon of a model hiding misaligned behaviour when it believes it's being observed.

In this hackathon, we will be constructing model organisms of alignment faking: realistic, experimentally-verified pathways under which alignment faking can occur. We'll be test-driving a new framework for alignment faking experiments. The environment, monitoring and scoring are already set up - all we need to do is supply the models! These can be fine-tunes of open source models or simple prompt engineering.

Trajectory Labs, the jamsite, provides a comfortable and spacious coworking space along with coffee, tea, and other refreshments (meals not provided, but there are many nearby options). Other locations will also be taking part!

Bring a laptop (beefy GPUs are not necessary, we'll provide credits for API-based finetuning of open source models so you don't need to run them locally).

More information on the Alignment Faking Hackathons notion page

Location