Microsoft’s new vision-language (VL) system significantly surpasses human performance
Vision-language (VL) systems allow searching the relevant images for a text query (or vice versa) and describing the content of an image using natural language. In general, a VL system uses an image encoding module…