Collov@CVPR'24: Long-form Video Understanding Workshop
βLong-form Video Understanding Towards Multimodal AI Assistant and Copilot Workshop @ CVPR'24
βπ
Date: June 17, 2024 (the schedule is subject to change, and all registered attendees will be notified promptly of any adjustments)
π Location: 2024 Conference on Computer Vision and Pattern Recognition
π’ Industry Host: Collov AI, a leader in AI-powered interior design (website: collov.ai; Twitter: @collov_ai)
βCollov AI invites the academic community to join our workshop at CVPR'24, organized by researchers from NUS, UTS, UC Berkeley, FAIR, ZJU, UT Austin, Meta AI, Dartmouth, NTU, Shanghai AI Lab and UW. The workshop will explore advancements and emerging challenges in the field of long-form video understanding. This yearβs workshop features focused tracks on key areas of research and development:
βTrack 1: Long-Term Video Question Answering
βThis track addresses the challenges of interpreting and answering questions based on content from extended video footage. Research presentations and discussions will focus on algorithm development for better understanding the context and details within long video sequences. Participants will examine case studies and current research that utilize AI to parse and respond to nuanced inquiries over lengthy durations, enhancing automated systems' capabilities in media analysis and interaction.
βTrack 2A: Text-Guided Video Editing
βTrack 2A provides an in-depth look at how textual metadata and scripts can dynamically influence video editing processes. The track will cover various approaches, including the use of AI to interpret text cues and apply them for video segmentation, scene recognition, and content-appropriate editing techniques. It aims to bridge the gap between traditional video editing and automated, text-driven processes, fostering a discussion on the integration of advanced language models with video editing tools.
βTrack 2B: Text-to-Video Generation
βIn Track 2B, participants will explore the field of generating video content directly from text descriptions. This track will include comprehensive overviews of the current state-of-the-art technologies, practical applications, and the theoretical foundations of text-to-video synthesis. Discussions will focus on the challenges of ensuring fidelity and coherence in generated videos, as well as potential applications in entertainment, education, and content creation.
βSpeakers:
βDima Damen (University of Bristol / Google DeepMind)
βMarc Pollefeys (ETH Zurich / Microsoft)
βChunyuan Li (ByteDance / TikTok)