Personal

We will do readings on Sparse Autoencoders:

https://transformer-circuits.pub/2022/toy_model/toy_model.pdf

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning 

https://transformer-circuits.pub/2023/monosemantic-features

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet 

https://transformer-circuits.pub/2024/scaling-monosemanticity/

We will then explore some sparse autoencoders in a notebook. We will make the notebook available in Github so you can follow along if you like. 

Sparse Autoencoders (Papers + Code)