Cover Image for Hackathon: Alignment Faking
Cover Image for Hackathon: Alignment Faking
Avatar for Trajectory Labs
Presented by
Trajectory Labs
18 Went
Registration
Past Event
Welcome! To join the event, please register below.
About Event

Important registration information: ​​To participate in this event, please join the discord link before registering.

Many safety and governance measures rely on AI models showing us their true colours. "Alignment faking" is the phenomenon of a model hiding misaligned behaviour when it believes it's being observed.

In this hackathon, we will be constructing model organisms of alignment faking: realistic, experimentally-verified pathways under which alignment faking can occur. We'll be test-driving a new framework for alignment faking experiments. The environment, monitoring and scoring are already set up - all we need to do is supply the models! These can be fine-tunes of open source models or simple prompt engineering.

​​Trajectory Labs, the jamsite, provides a comfortable and spacious coworking space along with coffee, tea, and other refreshments (meals not provided, but there are many nearby options). Other locations will also be taking part!

Bring a laptop (beefy GPUs are not necessary, we'll provide credits for API-based finetuning of open source models so you don't need to run them locally).

More information on the Alignment Faking Hackathons notion page

Location
30 Adelaide St E 12th floor
Toronto, ON M5C, Canada
Call/text Giles (+1 (647) 823-4865) if you need help.
Avatar for Trajectory Labs
Presented by
Trajectory Labs
18 Went