

TAI AHR #09 - Foundation Models for Robotics
This upcoming event will explore the future of Vision-Language-Action (VLA) models in robotics, covering their foundational concepts, real-world deployment challenges, and the need for foundation models in mobile robots. Discussions will highlight how these models enable generalist agents that can see, understand, and act, and how they may reshape embodied AI across industries and daily life.
Agenda
18:00 Doors open
18:30 - 19:00 Core Ideas of Vision-Language-Action (VLA) Models (Émilie Fabre)
19:00 - 19:30 Challenges of Vision-Language-Action (VLA) Model Inference (Maxime Alvarez)
19:30 - 20:00 Why do we need foundation models for mobile robots (and why Japan is the best place to create them)? (Ibrahim Orhan)
20:00 - 21:00 Networking
21:00 Doors close
Speakers:
Talk 1 - Core Ideas of Vision-Language-Action (VLA) Models
Speaker: Émilie Fabre (Research Assistant, University of Tokyo)
Abstract: Robotics has always made use of cutting-edge AI to power fast, reliable policies in industrial settings. Today, advances in multimodal AI are opening the door to a new generation of agents that can see, understand, and act. Blending vision, language, and action into a new type of model: Vision-Language-Action (or VLA) models. This talk introduces the core ideas behind VLA models, exploring how they work, how they’re trained, and what makes them uniquely powerful. We'll look at current research directions, real-world capabilities, and what these generalist agents might mean for future applications, both in business and in everyday life. Finally, we’ll consider how such models might reshape embodied interaction in XR and beyond.
Bio: Émilie is a PhD candidate at the University of Tokyo. Advised by Jun Rekimoto and Yuta Itoh, her main work focuses on enabling virtual entities in XR to better interact and communicate with our reality. Her strong technical background led her to bridge XR, robotics, and AI with the goal of creating corporeal virtual agents for general-purpose applications. With an emphasis on HCI and HRI, she strives to make interactions with virtual agents feel intuitive, expressive, and grounded in the physical world.
Talk 2 - Challenges of Vision-Language-Action (VLA) Model Inference
Speaker: Maxime Alvarez (Robotics AI Engineer, Telexistence)
Abstract: Vision-Language-Action models offer a powerful framework for building generalist agents that can see, understand, and act. Yet, running these models on real-world robots comes with unique challenges, especially around latency, size, and real-time execution. We will explore practical strategies to bridge the gap between large VLA models and robotic deployment. We'll cover inference techniques like action chunking, streaming, and asynchronous execution, as well as architectural ideas such as dual-system designs. We will finish with a real-world use case from Telexistence, highlighting how some of these methods come together in practice.
Bio: Maxime is a PhD student at the University of Tokyo in Matsuo Lab, advised by Yutaka Matsuo and working closely with Yusuke Iwasawa and Tatsuya Matsushima. His research explores vision-language-action models and robotic foundation models for general-purpose autonomy. He is also a Robotic Foundation Model Engineer at Telexistence. With a background in software engineering and applied mathematics, Maxime aims to bridge robotics and AI through scalable, multimodal learning.
Talk 3 - Why do we need foundation models for mobile robots (and why Japan is the best place to create them)?
Speaker: Ibrahim Orhan (Co-founder, Kanaria Tech)
Abstract: Why don’t we see robots everywhere, tackling the dirty, dangerous, and dull tasks we dread? Why can’t we send Automated Mobile Robots into chemical plants, nuclear sites, landfills, hospitals, or busy city streets to make deliveries? Wheels have been our best way to move heavy loads for millennia, and the hardware for strong mobile robots has existed for over a decade. Yet software and intelligence held us back, well, until now. With foundation models designed for AMRs, we finally have a chance to set robots free. We’ve been using AI to replace art, so we can wash more dishes? Instead, let’s send robots into harm’s way and give ourselves space to create.
Bio: Ibrahim is a scientist-turned-founder who began his journey in synthetic biology, engineering bacteria for biological cooling systems, smell detection, and resonance-induced protein synthesis. This led him into protein engineering and bioinformatics, his first deep dive into machine learning eight years ago. He later joined an AI startup as a core member and conducted research on robotics and action recognition models at Ritsumeikan University. Today, he is the co-founder and CEO of Kanaria Tech, where they are tackling one of the toughest problems in AI: building embodied intelligence to give robots human-level social awareness and the ability to learn on their own.
Tokyo AI (TAI) information
TAI is the biggest AI community in Japan, with 2,400+ members mainly based in Tokyo (engineers, researchers, investors, product managers, and corporate innovation managers).
Web: https://www.tokyoai.jp/