Depth imaging is commonly based on light. For example, LIDAR and Kinect use infrared light, while stereo cameras use visible light. These systems require hardware operating at high sampling frequencies, precise calibration, and they dissipate significant power. In this paper, we investigate the potential of ultrasound for image and depth acquisition, with applications to human-computer interaction and skeletal tracking in mind. We use a loudspeaker array and a microphone array to sense the scene. We discuss a technique for offline loudspeaker beamforming (commonly used for microphone beamforming) which enables us to significantly increase the frame rate. Further, we propose a sound-source-localization-based method for computing the depth image, giving a substantial improvement over the naïve time-of-flight approach. We designed inexpensive hardware with eight elements per array to obtain both the depth and the intensity images. Even with this limited number of transducers we obtain promising experimental results.