Teaching a robot to see and navigate with simulation
Editor’s note: This post and its research are the collaborative efforts of our team, which includes Wenshan Wang, Delong Zhu of the Chinese University of Hong Kong; Xiangwei Wang of Tongji University; Yaoyu Hu, Yuheng Qiu, Chen Wang of the Robotics Institute of Carnegie Mellon University; Yafei Hu, Ashish Kapoor of Microsoft Research, and Sebastien Scherer.
The ability to see and navigate is a critical operational requirement for robots and autonomous systems. For example, consider autonomous rescue robots that are required to maneuver and navigate in challenging physical environments that humans cannot safely access. Similarly, building AI agents that can efficiently and safely control perception-action requires thoughtful engineering to enable sensing and perceptual abilities in a robot. However, building a real-world autonomous system that can operate safely at scale is a very difficult task. The partnership between Microsoft Research and Carnegie Mellon University, announced in April 2019, is continuing to advance state of the art in the area of autonomous systems through research focused on solving real-world challenges such as autonomous mapping, navigation, and inspection of underground urban and industrial environments.
Simultaneous Localization and Mapping (SLAM) is one of the most fundamental capabilities necessary for robots. SLAM has made impressive progress with both geometric-based methods and learning-based methods; however, robust and reliable SLAM systems for real-world scenarios remain elusive. Real-life environments are full of difficult cases such as light changes or lack of illumination, dynamic objects, and texture-less scenes.
Recent advances in deep reinforcement learning, data-driven control, and deep perception models are fundamentally changing how we build and engineer autonomous systems. Much of the success in the past with SLAM has come from geometric approaches. The availability of large training datasets, collected in a wide variety of conditions, help push the envelope of data-driven techniques and algorithms.
SLAM is fundamentally different and complicated due to the sequential nature of recognizing landmarks (such as buildings and trees) in a dynamic physical environment while driving or flying through it versus the recognition of static images, object recognition, or activity recognition. Secondly, many SLAM systems use multiple sensing modalities, such as RGB, depth cameras, and LiDAR, which makes data collection a considerable challenge. Finally, we believe that a key to solving SLAM robustly is curating data instances with ground truth in a wide variety of conditions, with varying lighting, weather, and scenes—a task that is daunting and expensive to accomplish in the real world with real robots.
We present a comprehensive dataset, the TartanAir, for robot navigation tasks, and more. The large dataset was collected using photorealistic simulation environments based on AirSim, with various light and weather conditions and moving objects. Our paper on the dataset has been accepted and will be appearing at the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2020). By collecting data in simulation, we can obtain multi-modal sensor data and precise ground truth labels, including the stereo RGB image, depth image, segmentation, optical flow, camera poses, and LiDAR point cloud. The TartanAir contains a large number of environments with various styles and scenes, covering challenging viewpoints and diverse motion patterns, which are difficult to achieve by using physical data collection platforms. Based on the TartanAir dataset, we are hosting a visual SLAM challenge that kicked off in a Computer Vision and Pattern Recognition (CVPR) 2020 workshop. It consists of a monocular track and a stereo track. Each track consists of 16 trajectories containing challenging features that aim to push the limits of visual SLAM algorithms. The goal is to localize the robot and map the environment from a sequence of monocular/stereo images. The deadline to submit entries to the challenge is August 15th, 2020. To learn more about participating, visit the challenge site.
We minimize the sim-to-real gap by utilizing a large number of environments with various styles and diverse scenes. A unique goal of our dataset is to focus on the challenging environments with changing light conditions, adverse weather, and dynamic objects. State-of-the-art SLAM algorithms struggle to track the camera pose in our dataset, continuously getting lost on some challenging sequences. We propose a metric to evaluate the robustness of the algorithm. Also, we developed an automatic data collection pipeline for the TartanAir dataset, which allows us to process more environments with minimum human intervention.
We have adopted 30 photorealistic simulation environments. The environments provide a wide range of scenarios that cover many challenging situations. The simulation scenes consist of:
- Indoor and outdoor scenes with detailed 3D objects. The indoor environments are multi-room and richly decorated. The outdoor environments include various kinds of buildings, trees, terrains, and landscapes.
- Special purpose facilities and ordinary household scenes.
- Rural and urban scenes.
- Real-life and sci-fi scenes.
In some environments, we simulated multiple types of challenging visual effects.
- Hard lighting conditions: day-night alternating, low lighting, and rapidly changing illumination.
- Weather effects: clear, raining, snowing, windy, and fog.
- Seasonal change.
Using AirSim, we can extract various types of ground truth labels, including depth, semantic segmentation tag, and camera pose. From the extracted raw data, we further compute other ground truth labels such as optical flow, stereo disparity, simulated multi-line LiDAR points, and simulated IMU readings.
In each simulated environment, we gather data by following multiple routes and making movements with different levels of aggressiveness. The virtual camera can move slowly and smoothly without sudden jittering actions. Or it can have intense and violent actions mixed with significant rolling and yaw motions.
We develop a highly automated pipeline to facilitate data acquisition. For each environment, we build an occupancy map by incremental mapping. Based on the map, we then sample several trajectories for the virtual camera to follow. A set of virtual cameras follow the trajectories to capture raw data from AirSim. Raw data are processed to generate labels such as optical flow, stereo disparity, simulated LiDAR points, and simulated IMU readings. At last, we verify the data by warping the image over time and space to make sure they are valid and synchronized.
Although our data is synthetic, we aim to push the SLAM algorithms toward real-world applications. We hope that the proposed dataset and benchmarks will complement others and help to reduce overfitting to datasets with limited training or testing examples and contribute to the development of algorithms that work well in practice. As our preliminary results show in the experiments on learning-based visual odometry (VO), a simple VO network trained on this diverse TartanAir dataset could directly be generalized to a real-world dataset such as KITTI and EuRoC and outperforms the classical geometrical methods on challenging trajectories. We hope the TartanAir dataset can push the limits of the current visual SLAM algorithms toward real-world applications.
Learn more about AirSim in the upcoming webinar, where Sai Vemprala will demonstrate how it can help train robust and generalizable algorithms for autonomy. Register by July 8, 2020.