Cover Image for AI Safety Thursdays: When Good Rewards Go Bad - Reward Overoptimization in RLHF
Cover Image for AI Safety Thursdays: When Good Rewards Go Bad - Reward Overoptimization in RLHF
Avatar for Trajectory Labs
Presented by
Trajectory Labs
Hosted By
13 Went

AI Safety Thursdays: When Good Rewards Go Bad - Reward Overoptimization in RLHF

Registration
Past Event
Welcome! To join the event, please register below.
About Event

Today's Topic

Reinforcement learning with human feedback (RLHF) has become a popular way to align AI behavior with human preferences. But what happens when the system gets too good at optimizing the reward signal?

Evgenii Opryshko will guide us through an exploration of how overoptimization can lead to unintended behaviors, why it happens, and what we can do about it. We'll look at examples, discuss open challenges, and consider what this means for aligning advanced AI systems.

Event Schedule
6:00 to 6:45 - Networking and refreshments
6:45 to 8:00 - Main Presentation
8:00 to 9:00 - Breakout Discussions

Location
30 Adelaide St E
Toronto, ON M5C 3G8, Canada
Enter the main lobby of the building and let the security staff know you are here for the AI meetup. You may need to show your RSVP on your phone. You will be directed to the 12th floor where the meetup is held. If you have trouble getting in, give Smitty a call at 647-424-4111.
Avatar for Trajectory Labs
Presented by
Trajectory Labs
Hosted By
13 Went