

Lossfunk talks || Tokenize Everything (Images Edition) || Prashant Shishodia
How can we build a single AI model that native understands and generates all modalities - images, audio, and video? I argue that discrete tokenization can make the problem much simpler. We focus on images & discuss whether images can be tokenized into discrete tokens, and if those tokens are useful for understanding & generation. We'll review the evolution from pixel/waveform-based models to the dominant two-stage latent generation approach. Key techniques like VQ-VAE (discrete) and KL-regularized VAEs (continuous) will be compared, highlighting their impact on model efficiency and capabilities. We will critically examine the challenges – balancing compression, perceptual quality, and modelability; controlling latent capacity; and the inflexibility of grid structures.