Mobile phones have two sensors: a camera and a microphone. Our goal in this position paper it to explore the use of these sensors for building an audio-visual sensor network that exploits the deployed base of millions of mobile phones worldwide. Among the several salient features of such a sensor network, we focus on mobility. Mobility is advantageous since it yields significant advantage in spatial coverage. However, due to the uncontrolled nature of device motion, it is difficult to sample a required region with a given device. We propose a data based abstraction to deal with this difficulty. Rather than treating the physical devices as our sensor nodes, we introduce a layer of static virtual sensor nodes corresponding to the sampled data locations. The virtual nodes corresponding to the required region to be sensed can be queried directly to obtain data samples for that region. We discuss how the locations of the virtual sensor nodes can be enhanced, and sometimes derived, using the visual data content itself. Experiments with real data are presented to expose some of the practical considerations for our design approach.