Compact Proofs of Model Performance via Mechanistic Interpretability – Louis Jaburi
Compact Proofs of Model Performance via Mechanistic Interpretability
Louis Jaburi – Independent researcher
Generating proofs about neural network behavior is a fundamental challenge as their internal structure is highly complicated. With recent progress in mechanistic interpretability, we have better tools to understand neural networks. In this talk, I will present a novel approach that leverages interpretations to construct rigorous (&compact!) proofs about model behavior, based on recent work [1][2]. I will explain how understanding a model's internal mechanisms can enable stronger mathematical guarantees about its behavior and discuss how these approaches connect to the broader guaranteed safe AI framework. Drawing from practical experience, I will share key challenges encountered and outline directions for scaling formal verification to increasingly complex neural networks.
[1] Compact Proofs of Model Performance via Mechanistic Interpretability (https://arxiv.org/abs/2406.11779)
[2] Unifying and Verifying Mechanistic Interpretations: A Case Study with Group Operations (https://arxiv.org/abs/2410.07476)
GS AI seminars
The monthly seminar series on Guaranteed Safe AI brings together researchers to advance the field of building AI with high-assurance quantitative safety guarantees.
Apply to speak
Video recordings
Mailing list
Feedback
Donate 🧡