Visually Grounded Language Understanding and Generation

  • Jiasen Lu | Georgia Tech

In this talk, I will present our latest work on comprehending and generating visually grounded language. First, we will discuss the challenging task of learning visual grounding of language. I will introduce how to pretrain task-agnostic visiolinguistic representations for a variety of vision and language tasks. In the second part of the talk, I will describe our recent work on image captioning that can produce natural language explicitly grounded in entities that object detectors find in the image. At the end of the talk, I will briefly discuss some ongoing work efforts on vision and language multi-task learning and generating goal driven visual dialog without dialog data.

[SLIDES]

Speaker Details

Jiasen Lu is a Ph.D. student in the School of Interactive Computing at Georgia Tech, advised by Prof. Devi Parikh. His research is in computer vision, focusing particularly on the intersection between vision and language, including tasks such as visual question answering (VQA), image captioning and visual dialog. He has published at major computer vision (CVPR, ICCV, ECCV), machine learning (NeurIPS, ICLR) and robotics (CORL) conferences, and is a co-organizer of the first and second VQA workshop at CVPR.