The traditional text-based visual search has not been sufficiently improved over the years to accommodate the new emerging demand of mobile users. While on the go, searching on one’s phone is becoming pervasive. This paper presents an innovative application for mobile phone users to facilitate their visual search experience. By taking advantage of smart phone functionalities such as multi-modal and multi-touch interactions, users can more conveniently formulate their search intent, and thus search performance can be significantly improved. The system, called JIGSAW (Joint search with ImaGe, Speech, And Words), represents one of the first attempts to create an interactive and multi-modal mobile visual search application. The key of JIGSAW is the composition of an exemplary image query generated from the raw speech via multi-touch user interaction, as well as the visual search based on the exemplary image. Through JIGSAW, users can formulate their search intent in a natural way like playing a jigsaw puzzle on the phone screen: 1) a user speaks a natural sentence as the query, 2) the speech is recognized and transferred to text which is further decomposed to keywords through entity extraction, 3) the user selects preferred exemplary images that can visually represent his/her intent and composes a query image via multi-touch, and 4) the composite image is then used as a visual query to search similar images. We have deployed JIGSAW on a real-world phone system, evaluated the performance on one million images, and demonstrated that it is an effective complement to existing mobile visual search applications.