This letter presents a novel approach for simultaneous depth and spectral imaging with a cross-modal stereo system. Two images of the target scene are captured at the same time: one compressively sampled hyperspectral measurement and one panchromatic measurement. The underlying hyperspectral cube is first reconstructed by leveraging the compressive sensing theory, during which a self-adaptive dictionary is learned from the panchromatic measurement to facilitate the reconstruction. The depth information of the scene is then recovered by estimating a disparity map between the hyperspectral cube and the panchromatic measurement through stereo matching. This disparity map, once obtained, is used to align the hyperspectral and panchromatic measurements to boost the hyperspectral reconstruction in an iterative manner. Through hardware prototype experiments, for the first time to our knowledge, we demonstrate a snapshot system that allows for simultaneous depth and spectral imaging. The proposed system has the potential to record depth and spectral videos of dynamic scenes at the same time.