Using a diversity of big data to infer and predict fine-grained air quality throughout a city, and finally tackle air pollutions.
Many countries are suffering from air pollutions. Many cities have built a few air quality monitoring stations to inform people urban air quality every hour. Influenced by multiple complex factors, however, urban air quality is highly skewed in a city, varying by locations significantly and changing over time differently in different places. Thus, we do not know the air quality of a location without a monitoring station. We do not what the air quality at a place will be tomorrow either, let alone the root cause the air pollution.
This project aims to predict the fine-grained air quality of current time throughout a city and forecast the air quality of future time at each monitoring station. We also expect to identify the root cause of air pollution. For example, what’s the proportion of PM2.5 in the environment derived from vehicular emission. what is the spatio-temporal causality interaction between the air pollutions of different cities?
The research has been publicly available through a “cloud + client” framework, where the cloud continuously collect real-time data, such meteorological data and air quality data. A user can access the air quality information through using a mobile client or web client. Urban Air
The first step of this project is to infer the real-time and fine-grained air quality of arbitrary location by using two parts of data. One is the real-time and historical air quality data from existing monitoring stations. The other is five additional data sources we observed in a city, consisting of meteorological data, traffic, human mobility, POIs, and road network data. We propose a semi-supervised learning approach based on a co-training framework that consists of two separated classifiers. One is a spatial classifier based on an artificial neural network (ANN), which takes spatially-related features (e.g., the density of POIs and length of highways) as input to model the spatial correlation between air qualities of different locations. The other is a temporal classifier based on a linear-chain conditional random field (CRF), involving temporally-related features (e.g., traffic and meteorology) to model the temporal dependency of air quality in a location. Read the related publications for more details.
 Yu Zheng, Furui Liu, Hsun-Ping Hsieh. U-Air: When Urban Air Quality Inference Meets Big Data. In Proceedings of the 19th SIGKDD conference on Knowledge Discovery and Data Mining (KDD 2013). (Data) (Website) (Mobile App)(Video) Yu Zheng, Xuxu Chen, Qiwei Jin, Yubiao Chen, Xiangyun Qu, Xin Liu, Eric Chang, Wei-Ying Ma, Yong Rui, Weiwei Sun. A Cloud-Based Knowledge Discovery System for Monitoring Fine-Grained Air Quality. MSR-TR-2014-40.
A Dataset is released for research purposes: download the data.
The second step is to predict the fine-grained air quality of the next 48 hours. Specifically, in the first 6 coming hours, we predict a real-valued AQI for each kind of air pollutant, at each hour, in each station. For the next 7-12, 12-24, and 24-48 hours, we predict a max-min range of the AQIs at the corresponding time interval. Our predictive model is comprised of four major components: 1) a linear regression-based temporal predictor to model the local factor of air quality, 2) a neural network-based spatial predictor modeling the global factors, 3) a dynamic aggregator combining the predictions of the spatial and temporal predictors according to the meteorological data, and 4) an inflection predictor to capture the sudden changes of air quality.
 Yu Zheng, Xiuwen Yi, Ming Li, Ruiyuan Li, Zhangqing Shan, Eric Chang, Tianrui Li. Forecasting Fine-Grained Air Quality Based on Big Data. In the Proceeding of the 21th SIGKDD conference on Knowledge Discovery and Data Mining (KDD 2015).
The service of Urban Air covers 300 cities:
Given a limited budget to build a few additional air quality monitoring stations, where shall we put them? The research solves this problem from the perspective of maximizing the inference accuracy and stability.
 Hsun-Ping Hsieh*, Shou-De Lin, Yu Zheng. Inferring Air Quality for Station Location Recommendation Based on Big Data. In the Proceeding of the 21th SIGKDD conference on Knowledge Discovery and Data Mining (KDD 2015).
- Study the correlation between vehicular emission and air quality
- Identify the spatio-temporal causality between air pollutants of different cities.
 Julie Yixuan Zhu, Chao Zhang, Huichu Zhang, Shi Zhi, Victor O.K. Li, Jiawei Han, and Yu Zheng, pg-Causality: Identifying Spatiotemporal Causal Pathways for Air Pollutants with Urban Big Data. IEEE Transactions on Big Data. DOI: 10.1109/TBDATA.2017.2723899 (Code and Data) Julie Yixuan Zhu, Chao Zhang, Yu Zheng, Shi Zhi, Victor O.K. Li, Jiawei Han. p-Causality: Identifying Spatio-temporal Causal Pathways for Air Pollutants with Urban Big Data, arXiv
 Analyzing newly available data about the intricacies of urban life could make cities better.” MIT Technology Review. 2013.8.21 Interviewed by IFeng.com. Big data can predict air quality. 2013.11.29 (In Chineses) ComputerWorld: Microsoft predicts China’s air pollution with data analysis, 2015.6.11 Ming Pao (HK): Microsoft predicts air quality with big data, 2015.6.10 GeekWire Reporter: What Microsoft Research is doing to help Beijing air pollution. 2015.11.30 NBC News: Microsoft, IBM Eye Big Business Opportunity in China’s Air Pollution. 2015.12.28 Reuters: Tech giants spot opportunity in forecasting China’s smog, 2015.12.28 China Daily: Microsoft, IBM eye Technology to forecast air pollution in China. 2016.1.19 IEEE Spectrum: AI and Big Data vs. Air Pollution. 2016.12.19