Cover Image for Transformers need glasses! Information over-squashing in language tasks

Presented by

Data Phoenix Events is designed to help anyone interested in AI & Data expand their expertise horizons. Our events cover various topics, including MLOps, Gen AI, LLMs, NLP, CV, Data Science, and more.

Hosted By

Transformers need glasses! Information over-squashing in language tasks

Data Phoenix Events

Virtual

Past Event

Welcome! Please choose your desired ticket type:

You will be asked to verify token ownership with your wallet.

About Event

The Data Phoenix team invites you to our upcoming webinar, which will take place on September 19 at 10 a.m. PT.

Topic: Transformers need glasses! Information over-squashing in language tasks
Speakers: Petar Veličković (Staff Research Scientist, Google DeepMind; Affiliated Lecturer, University of Cambridge)
Participation: free (but you’ll be required to register)

We study how information propagates in decoder-only Transformers, which are the architectural backbone of most existing frontier large language models (LLMs). We rely on a theoretical signal propagation analysis -- specifically, we analyse the representations of the last token in the final layer of the Transformer, as this is the representation used for next-token prediction. Our analysis reveals a representational collapse phenomenon: we prove that certain distinct sequences of inputs to the Transformer can yield arbitrarily close representations in the final token. This effect is exacerbated by the low-precision floating-point formats frequently used in modern LLMs. As a result, the model is provably unable to respond to these sequences in different ways -- leading to errors in, e.g., tasks involving counting or copying. Further, we show that decoder-only Transformer language models can lose sensitivity to specific tokens in the input, which relates to the well-known phenomenon of over-squashing in graph neural networks. We provide empirical evidence supporting our claims on contemporary LLMs. Our theory also points to simple solutions towards ameliorating these issues.

Highlight / tl;dr:

Modern LLM systems rely on decoder-only Transformers and causal masks to achieve success at scale---but this architectural choice yields with it several fundamental bottlenecks, which makes such systems vulnerable to underperforming at various essential tasks, such as copying and counting. We'll explore why this issue happens, and some simple strategies for mitigating it that you can apply in practice.

Speaker

Petar Veličković is a Staff Research Scientist at Google DeepMind and an Affiliated Lecturer at the University of Cambridge, where he is also an Associate of Clare Hall. He holds a PhD in Computer Science from the University of Cambridge (Trinity College), obtained under the supervision of Pietro Liò. His research concerns geometric deep learning—devising neural network architectures that respect the invariances and symmetries in data. For his contributions, he is recognised as an ELLIS Scholar in the Geometric Deep Learning Program. Particularly, he focuses on graph representation learning and its applications in algorithmic reasoning. Petar is the first author of Graph Attention Networks—a popular convolutional layer for graphs—and Deep Graph Infomax—a popular self-supervised learning pipeline for graphs. His research has been used in substantially improving travel-time predictions in Google Maps, and guiding intuition of mathematicians towards new top-tier theorems and conjectures.

Please join DataPhoenix Discord and follow us on LinkedIn and YouTube to stay updated on our community events and the latest AI and data news.