Machine Learning will help speed the Big Data revolution
By: Nuala MoranBy: Nuala Moran, FUTURES contributor
21 June 2013

Send EMail

Over the past decade the science of machine learning has given rise to some notable advances in Microsoft products, from handwriting recognition, to sophisticated tools for predicting the probability of a particular advert being clicked on a web page, and the gesture-activated interface on the new Xbox One games console.

Working with specialists in the academic community and in interdisciplinary groups across Microsoft’s research labs, new machine learning methods have been translated from theory to robust and reliable applications that sit at the heart of consumer devices and enterprise systems.

Prof. Christopher Bishop, Distinguished

 Scientist, MSFT Research
Prof. Christopher Bishop, Distinguished Scientist, MSFT Research

Christopher Bishop, head of the Machine Learning and Perception group at Microsoft Research Cambridge, pinpoints handwriting recognition on the Tablet PC as the earliest example of this translation process. “This was the first commercial consumer system to have a level of accuracy in handwriting recognition to be of practical use,” he says.


If that was a breakthrough, there are now multiple examples of smart – but taken for granted – features that are based on machine learning. One case in point is the tool for manipulating pictures in Microsoft Office. Editing as he speaks, Bishop removes the background from a photograph of a dog. “You can cut the dog out and paste it into a Word document – it’s a near perfect job, that is based on machine learning. I’m doing something in real time to the image, but the underlying algorithm is tuned by looking at huge numbers of images.”

At a simple level, the analysis required to neatly remove the background without chopping off the dog’s ear involves crunching endless statistics about objects in general and the distribution of colours in objects versus the distribution of colours in the background. “It’s a seemingly impossible challenge that is now taken for granted,” Bishop notes. Another “great example” of a Microsoft product with machine learning at its heart is the Kinect motion and voice recognition sensor. Similarly, the advertising prediction algorithm on the Bing search engine – which is used to estimate the probability that an advert will be clicked – has its roots in machine learning.


Bishop views machine learning as a basket for a number of technologies that have been developed, refined and fallen in and out of fashion over the past thirty to forty years, including artificial intelligence, experts systems, neural networks and data mining. In essence, these technologies underpin computer systems that learn from statistical analyses of data sets. Now Bishop and his team are adding a powerful tool with Infer.NET, which is freely available for non-commercial use. This provides a framework for writing software that can adapt, learn and reason, taking machine learning to a higher level in which algorithms that scale to millions of data points will tame, manipulate and interrogate the rising tide of Big Data, generating original insights and knowledge.


A data driven revolution

“We are at the start of a data-driven revolution. To exploit this data to the full and drive the revolution we need new techniques for machine learning that scale,” Bishop says. Infer.NET provides this capability by making it possible to combine detailed domain knowledge – as expressed in a graphical model – with the statistical method of Bayesian inference by which the probability of a hypothesis being right is automatically updated as additional evidence comes to hand.

There can be multiple possible interpretations of some measurements. For example in handwriting recognition, ‘hello’ might also be interpreted as ‘halo’ or ‘hell’. In the past taking account of these possibilities was done manually. Using Infer.NET, the computer learns over time. “At the moment, computers are doing the number crunching but humans are providing the intelligence. Model-based learning will change that,” says Bishop. Infer.NET will make it possible to write software to carry out intelligent tasks “where what you are trying to work out can’t be measured directly,” he adds.

Big Data has arisen independently of machine learning and today it is possible to gain insights from the torrents of data coming out of social media, web search, environmental sensors, genomics, and so on, without using machine learning. An example would be in data visualisation such as mapping crimes, a useful source of information for the police, insurance companies and house hunters alike.


However, applying machine learning to growing data repositories will lead to a step function change because it will be feasible to look for relationships across different kinds of data. Seen from this perspective, the confluence of machine learning and Big Data will engender a new method for carrying out scientific research and for conducting business analytics.

Understanding childhood asthma
Understanding childhood asthma

This potential is highlighted in a research programme that Bishop and his colleagues are carrying out with scientists at Manchester University, in which model-based machine learning is being applied alongside traditional statistical methods to investigate the causes of childhood asthma.

The research centres on a cohort of 1,000 children who have been followed since birth, with the aim of trying to understand the reason why some people get asthma and others do not. This is in the context that while there is thought to be an underlying genetic component, the incidence of asthma has increased markedly over the past 30 years, indicating a significant environmental component in the aetiology of the disease. However, to date these factors have been studied separately, and as a result attempts to unravel the specific contributions of genetics and environment have been inconclusive.

A huge amount of clinical, genetic, environmental and socio-economic data covering 2,000 variables has been gathered in the 12 or so years since the children were born. Model-based machine learning has made it possible to take an integrated view of all the domain knowledge of what factors might be important and weigh the impact of all these influences – and how they relate to each other - simultaneously.


Beyond medical statistics

The main output of the research to date has been of five natural groups of children – ranging from the largest 50 per cent of the cohort who do not have asthma to the smallest group, of children who develop multiple allergies at an early age and also go on to develop asthma. This information could prove useful in diagnosis, treatment and the development of new asthma drugs, and may also lead to a deeper understanding of the role of genetic factors in childhood asthma. “Using model-based machine learning we are able to look under a different lamp post from traditional medical statistics, and maybe this will allow us to find the keys,” says Bishop.

While the immediate goal is to understand the causes of asthma, cracking this will be a potent exemplar for using model-based machine learning to analyse clinical information relating to other diseases, and to unlock meaning from other Big Data repositories.

The research will also demonstrate the capabilities and scalability of Infer.NET, paving the way for machine learning to once again bring a new level of sophistication to a range of Microsoft products and deliver the power of the Big Data revolution to benefit consumers and businesses.