

Transformer Architecture from Scratch
In this talk, we will explore the groundbreaking paper "Attention Is All You Need" by Vaswani et al., which introduced the Transformer architecture—revolutionizing deep learning by replacing recurrence with self-attention. We will break down the key components of the Transformer, understand why self-attention enables parallelization and captures long-range dependencies, and discuss its impact across NLP and vision tasks. To gain a hands-on understanding, we will implement the Transformer Encoder-Decoder architecture from scratch using PyTorch.
Additionally, we will explore how multi-head attention is interpreted using Anthropic's "0-layer theory" and "1-layer theory", which provide insights into how attention heads behave in shallow models and early Transformer layers.
Agenda:
1. Understanding the Transformer Architecture Key innovations in "Attention Is All You Need" Comparison with previous sequence models (RNNs, LSTMs) Role of self-attention and positional encoding
2. Breaking Down the Transformer Components
Multi-Head Self-Attention Mechanism and purpose of multiple attention heads Interpretation of multi-head attention through: 0-layer theory: 0-layer Transformers approximate bigram statistics 1-layer theory: How single-layer Transformers lead to simplistic in context learning Feedforward layers and layer normalisation Positional encoding and residual connections
3. Implementing the Transformer from Scratch Writing multi-head self-attention in PyTorch Building the encoder and decoder blocks
4. Discussion & Future Directions