Cover Image for Compact Proofs of Model Performance via Mechanistic Interpretability – Louis Jaburi
Cover Image for Compact Proofs of Model Performance via Mechanistic Interpretability – Louis Jaburi
Avatar for Guaranteed Safe AI Seminars
Monthly seminars on topics related to Guaranteed Safe AI. https://www.horizonevents.info/guaranteedsafeaisem…
42 Went

Compact Proofs of Model Performance via Mechanistic Interpretability – Louis Jaburi

Zoom
Registration
Past Event
Welcome! To join the event, please register below.
About Event

Compact Proofs of Model Performance via Mechanistic Interpretability

Louis Jaburi – Independent researcher

Generating proofs about neural network behavior is a fundamental challenge as their internal structure is highly complicated. With recent progress in mechanistic interpretability, we have better tools to understand neural networks. In this talk, I will present a novel approach that leverages interpretations to construct rigorous (&compact!) proofs about model behavior, based on recent work [1][2]. I will explain how understanding a model's internal mechanisms can enable stronger mathematical guarantees about its behavior and discuss how these approaches connect to the broader guaranteed safe AI framework. Drawing from practical experience, I will share key challenges encountered and outline directions for scaling formal verification to increasingly complex neural networks.

[1] Compact Proofs of Model Performance via Mechanistic Interpretability (https://arxiv.org/abs/2406.11779)
[2] Unifying and Verifying Mechanistic Interpretations: A Case Study with Group Operations (https://arxiv.org/abs/2410.07476)

​​GS AI seminars

​​​​​The monthly seminar series on Guaranteed Safe AI brings together researchers to advance the field of building AI with high-assurance quantitative safety guarantees.

Avatar for Guaranteed Safe AI Seminars
Monthly seminars on topics related to Guaranteed Safe AI. https://www.horizonevents.info/guaranteedsafeaisem…
42 Went