Abstract

We consider strategies for reducing ambiguity in multi-modal data, particularly in the domain of images and text. Large data sets containing images with associated text (and vice versa) are readily available, and recent work has exploited such data to learn models for linking visual elements to semantics. This requires addressing a correspondence ambiguity because it is generally not known which parts of the images connect with which language elements. In this paper we first discuss using language processing to reduce correspondence ambiguity in loosely labeled image data. We then consider a similar problem of using visual correlates to reduce ambiguity in text with associated images. Only rudimentary image understanding is needed for this task because the image only needs to help differentiate between a limited set of choices, namely the senses of a particular word.