Video Understanding: From Tags to Language
- Chuang Gan | Tsinghua University
The increasing ubiquity of devices capable of capturing videos has led to an explosion in the amount of recorded video content. Instead of “eyeballing” the videos for potentially useful information, it has therefore been a pressing need to develop automatic video analysis and understanding algorithms for various applications. However, understanding videos on a large scale remains challenging: large variations and complexities, time-consuming annotations, and a wide range of involved video concepts. In light of these challenges, my research towards video understanding focuses on designing effective network architectures to learn robust video representations, learning video concepts from weak supervision and building a stronger connection between language and vision. In this talk, I will first introduce a Deep Event Network (DevNet) that can simultaneously detect pre-defined events and localize spatial-temporal key evidence. Then I will show how web crawled videos and images could be utilized for learning video concepts. Finally, I will present our recent efforts to connect visual understanding to language through attractive visual captioning and visual question segmentation.
-
-
Gang Hua
Principal Researcher/Research Manager
-
-
Watch Next
-
-
-
Magma: A foundation model for multimodal AI Agents
- Jianwei Yang
-
AI for Precision Health: Learning the language of nature and patients
- Hoifung Poon,
- Ava Amini,
- Lili Qiu
-
Accelerating Multilingual RAG Systems
- Nandan Thakur
-
-
-
-
Making Sentence Embeddings Robust to User-Generated Content
- Lydia Nishimwe
-