Cover Image for AI Safety Thursdays: When Good Rewards Go Bad - Reward Overoptimization in RLHF

Presented by

Catalyzing Toronto's role in steering AI progress toward a future of human flourishing. Join us for a variety of events on technical AI safety, governance in a world of advanced AI, and more.

Hosted By

13 Went

AI

AI Safety Thursdays: When Good Rewards Go Bad - Reward Overoptimization in RLHF

Name: AI Safety Thursdays: When Good Rewards Go Bad - Reward Overoptimization in RLHF
Start: 2025-05-15T18:00:00.000-04:00
End: 2025-05-15T21:00:00.000-04:00
Location: 30 Adelaide St E

Trajectory Labs

30 Adelaide St E

Toronto, Ontario

Past Event

Welcome! To join the event, please register below.

You will be asked to verify token ownership with your wallet.

About Event

Today's Topic

Reinforcement learning with human feedback (RLHF) has become a popular way to align AI behavior with human preferences. But what happens when the system gets too good at optimizing the reward signal?

Evgenii Opryshko will guide us through an exploration of how overoptimization can lead to unintended behaviors, why it happens, and what we can do about it. We'll look at examples, discuss open challenges, and consider what this means for aligning advanced AI systems.

Event Schedule
6:00 to 6:45 - Networking and refreshments
6:45 to 8:00 - Main Presentation
8:00 to 9:00 - Breakout Discussions

Location