Cover Image for AI Safety Thursdays: Tracing the Thoughts of a Large Language Model

Presented by

Catalyzing Toronto's role in steering AI progress toward a future of human flourishing. Join us for a variety of events on technical AI safety, governance in a world of advanced AI, and more.

Hosted By

36 Went

AI

AI Safety Thursdays: Tracing the Thoughts of a Large Language Model

Name: AI Safety Thursdays: Tracing the Thoughts of a Large Language Model
Start: 2025-06-05T18:00:00.000-04:00
End: 2025-06-05T21:00:00.000-04:00
Location: 30 Adelaide St E

Trajectory Labs

30 Adelaide St E

Toronto, Ontario

Past Event

Welcome! To join the event, please register below.

You will be asked to verify token ownership with your wallet.

About Event

Description

How do large language models actually work on the inside? Annie Sorkin presents on new research from Anthropic's Transformer Circuits team that opens up the "black box" of Claude 3.5 Haiku, revealing the computational mechanisms behind everything from multi-step reasoning to poetry planning.

Using a new methodology called attribution graphs, we'll explore how models handle multiple languages, exhibit concerning behaviors like jailbreaks, and sometimes engage in unfaithful reasoning.

Event Schedule

6:00 to 6:45 - Networking and refreshments

6:45 to 8:00 - Main Presentation

8:00 to 9:00 - Breakout Discussions

Location