Datasets
This is a GPS trajectory dataset collected in (Microsoft Research Asia) GeoLife project by 182 users in a period of over two years (from April 2007 to August 2012). This trajectory dataset can be used in many research fields, such as mobility pattern mining, user activity recognition, location-based social networks, location privacy, and location recommendation. The following heat maps visualize its distribution in Beijing.


please cite the following two papers when using this dataset.
[1] Yu Zheng, Quannan Li, Yukun Chen, Xing Xie. Understanding Mobility Based on GPS Data. In Proceedings of ACM conference on Ubiquitous Computing (UbiComp 2008), Seoul, Korea. ACM Press: 312-321.
[2] Yu Zheng, Lizhu Zhang, Xing Xie, Wei-Ying Ma. Mining interesting locations and travel sequences from GPS trajectories. In Proceedings of International conference on World Wild Web (WWW 2009), Madrid Spain. ACM Press: 791-800.
This is a sample of T-Drive taxi trajectory dataset which was generated by over 10,000 taxis in a period of one week in Beijing.
Please cite the following two papers when using the dataset:
[1] Jing Yuan*, Yu Zheng, Chengyang Zhang, Wenlei Xie, Xing Xie, Guangzhong Sun, Yan Huang. T-Drive: Driving Directions Based on Taxi Trajectories. In Proceedings of ACM SIGSPATIAL Conference on Advances in Geographical Information Systems (ACM SIGSPATIAL GIS 2010),
[2] Jing Yuan*, Yu Zheng, Xing Xie, Guangzhong Sun. Driving with Knowledge from the Physical World. accepted by 17th SIGKDD conference on Knowledge Discovery and Data Mining (KDD 2011).
This is a portion of GPS trajectory dataset collected in (Microsoft Research Asia) GeoLife project. Each trajectory has a set of transportation mode labels, such as by driving, taking a bus, riding a bike and walking, which can support transportation mode learning.
Please cite the following three papers when using this GPS dataset.
[1] Yu Zheng, Like Liu, Longhao Wang, Xing Xie. Learning Transportation Mode from Raw GPS Data for Geographic Application on the Web, In Proceedings of International conference on World Wild Web (WWW 2008), Beijing, China. ACM Press: 247-256
[2] Yu Zheng, Quannan Li, Yukun Chen, Xing Xie. Understanding Mobility Based on GPS Data. In Proceedings of ACM conference on Ubiquitous Computing (UbiComp 2008), Seoul, Korea. ACM Press: 312-321.
[3] Yu Zheng, Yukun Chen, Quannan Li, Xing Xie, Wei-Ying Ma. Understanding transportation modes based on GPS data for Web applications. ACM Transaction on the Web. Volume 4, Issue 1, January, 2010. pp. 1-36.
This simulator can generate people’s requests for taxicabs on different road segments, using the knowledge mined from a large-scale real taxi trajectories. Each query consists of an origin, destination, and a timestamp. Please cite the following paper when using the simulator.
[1] Shuo Ma, Yu Zheng, Ouri Wolfson. T-Share: A Large-Scale Dynamic Taxi Ridesharing Service. In Proceedings of the 29th IEEE International Conference on Data Engineering (ICDE 2013).
The dataset consists of the check-in data in New York City and Los Angels as well as the social structure of the users. Each check-in includes a venue ID, the category of the venue, a timestamp, and a user ID. Please cite the following papers when using the dataset.
[1] Jie Bao, Yu Zheng, Mohamed F. Mokbel. Location-based and Preference-Aware Recommendation Using Sparse Geo-Social Networking Data. ACM SIGSPATIAL GIS 2012.
[2] Ling-Yin Wei, Yu Zheng, Wen-Chih Peng, Constructing Popular Routes from Uncertain Trajectories. 18th SIGKDD conference on Knowledge Discovery and Data Mining (KDD 2012).
This dataset includes the concentration of three air pollutants, PM2.5, PM10, and NO2, from air quality monitoring stations in Beijing and Shanghai in the time span of 2013-2-8 to 2014-2-8. Please cite the following two papers when using the dataset.
[1] Yu Zheng, Furui Liu, Hsun-Ping Hsieh. U-Air: When Urban Air Quality Inference Meets Big Data. 19th SIGKDD conference on Knowledge Discovery and Data Mining (KDD 2013).
[2] Yu Zheng, Xuxu Chen, Qiwei Jin, Yubiao Chen, Xiangyun Qu, Xin Liu, Eric Chang, Wei-Ying Ma, Yong Rui, Weiwei Sun. A Cloud-Based Knowledge Discovery System for Monitoring Fine-Grained Air Quality. MSR-TR-2014-40.
The package is comprised of six parts of data that were extracted from the GPS trajectories of taxicabs, road networks, POIs of Beijing, and video clips recording real traffic on roads. Please cite the following two papers when using the dataset.
[1] Jingbo Shang*, Yu Zheng, Wenzhu Tong, Eric Chang. Inferring Gas Consumption and Pollution Emission of Vehicles throughout a City. In the Proceeding of the 20th SIGKDD conference on Knowledge Discovery and Data Mining (KDD 2014).
[2] Yu Zheng, Licia Capra, Ouri Wolfson, Hai Yang. Urban Computing: concepts, methodologies, and applications. ACM Transaction on Intelligent Systems and Technology (ACM TIST). 5(3), 2014.
This package is comprised of three parts of data. 1) tensors representing the 311 complaints on urban noise; 2) geographical feature of each region in NYC; 3) Real noise levels of 36 locations in NYC. Please cite the following two papers when using the dataset.
[1] Yu Zheng, Tong Liu, Yilun Wang, Yanchi Liu, Yanmin Zhu, Eric Chang. Diagnosing New York City’s Noises with Ubiquitous Data. In Proc of UbiComp 2014.
[2] Wang, Y., Zheng, Y., Liu, T. A noise map of New York City. In Proc. of UbiComp 2014.
The dataset was used for air quality forecast and real-time inference. It also can be used for test cross-domain data fusion methods. Please cite the following papers when using the dataset.
[1] Yu Zheng, Furui Liu, Hsun-Ping Hsieh. U-Air: When Urban Air Quality Inference Meets Big Data. 19th SIGKDD conference on Knowledge Discovery and Data Mining (KDD 2013).
[2] Yu Zheng, Xiuwen Yi, Ming Li, Ruiyuan Li, Zhangqing Shan, Eric Chang, Tianrui Li. Forecasting Fine-Grained Air Quality Based on Big Data. In the Proceeding of the 21th SIGKDD conference on Knowledge Discovery and Data Mining (KDD 2015).
The dataset contains bike usage (denoted by the number of check-outs and check-ins) at each bike sharing station in NYC and Chicago. The weather condition data during the period, in which the bike sharing data is collected, is also shared. Please cite the following papers when using the dataset.
[1] Yexin Lee, Yu Zheng, Huichu Zhang, Lei Chen. Traffic Prediction in a Bike Sharing System, In Proceedings of the 23rd ACM International Conference on Advances in Geographical Information Systems (ACM SIGSPATIAL 2015)
[2] Yu Zheng, Licia Capra, Ouri Wolfson, Hai Yang. Urban Computing: concepts, methodologies, and applications. ACM Transaction on Intelligent Systems and Technology (ACM TIST). 5(3), 2014.
This dataset is comprised of five parts of data, named Taxi Trip Data, Bike sharing data, 311 data, POIs and road network data of NYC. Please cite the following papers when using the dataset.
[1] Yu Zheng, Huichu Zhang, Yong Yu. Detecting Collective Anomalies from Multiple Spatio-Temporal Datasets across Different Domains. In Proceedings of the 23rd ACM International Conference on Advances in Geographical Information Systems (ACM SIGSPATIAL 2015). (Data) (Codes)
[2] Yu Zheng, Licia Capra, Ouri Wolfson, Hai Yang. Urban Computing: concepts, methodologies, and applications. ACM Transaction on Intelligent Systems and Technology (ACM TIST). 5(3), 2014.
This data set consists of two types of crowd flows. One is a five-year taxis flow in Beijing. The other is bike usage in a bike sharing system in New York City. A research on predicting flow of crowds have been conducted based on this dataset. Please cite the following paper when using the dataset.
[1] Junbo Zhang, Yu Zheng, Dekang Qi. Deep Spatio-Temporal Residual Networks for Citywide Crowd Flows Prediction, In Proceedings of the 31st AAAI Conference (AAAI 2017). (code)(data)(system)