Analyzing a Malicious AI agent
This is a free AI/AI safety workshop open for anyone to attend.
It is split into two parts: the first half is meant to serve as a non-technical introduction to AI and AI safety via a series of forecasting exercises. The second half is meant to introduce coders to a hands-on example of a toy malicious AI agent and try to analyze what might be causing its behavior.
Even if you do not have any coding background please feel free to attend. The first half is explicitly not-coding related and we will have AI-governance-related materials prepared if a large number of people without a coding background show up for the second half.
For the forecasting exercises (the first 2 hours):
We’ll be kicking off with a gentle introduction to the state of modern AI today going into more detailed forecasting exercises.
In particular, we will cover:
An exploration of what modern AI models currently are and are not capable of
Understanding the landscape of risk and dangers associated with current and future AI models
Using prediction markets to calibrate expectations and help forecast what kinds of AI developments might come and when
For the malicious agents part (the second 2 hours):
We’ll be spending the time dissecting an agent that’s been trained to navigate a maze and optionally harvest crops and/or humans along the way. During training, very sensibly the agent harvests crops and avoids humans. But as soon as we deploy the agent it goes out and starts harvesting humans!
Why might that happen? That’s the riddle we’ll be exploring!
During the session you’ll be doing some hands-on code exploration and spelunking with an AI model trained via reinforcement learning to play this game. While we don’t expect to have attendees be writing too much code from scratch (although some exploratory code may be written as you play with the models), we do expect attendees to be able to comfortably read Python code. Along the way we’ll be spending a little bit of time introducing people to the basics of reinforcement learning and its role in modern AI.
We will assume that participants are comfortable with Python and have constructed and trained a basic neural net before with some sort of autograd library such as PyTorch. If you are familiar with Python but have not previously trained a basic neural net, please reach out to us at info@aisafetyawarenessfoundation.org and we’re happy to see if there’s potentially some sort of introduction materials we could have you review beforehand.