Identifying and tracking people and mobile devices indoors has many applications, but is still a challenging problem. We introduce a cross-modal sensor fusion approach to track mobile devices and the users carrying them. The CrossMotion technique matches the acceleration of a mobile device, as measured by an onboard internal measurement unit, to similar acceleration observed in the infrared and depth images of a Microsoft Kinect v2 camera. This matching process is conceptually simple and avoids many of the difficulties typical of more common appearance-based approaches. In particular, CrossMotion does not require a model of the appearance of either the user or the device, nor in many cases a direct line of sight to the device. We demonstrate a real time implementation that can be applied to many ubiquitous computing scenarios. In our experiments, CrossMotion found the person’s body 99% of the time, on average within 7cm of a reference device position.