Exploring the Biases of Big Data

Published April 2, 2013

Share this page

Posted by Rob Knies

On Feb. 28, at the Santa Clara (Calif.) Convention Center, Kate Crawford (opens in new tab), principal researcher at Microsoft Research New England (opens in new tab), took the stage during the Strata Conference (opens in new tab) to deliver an illuminating, 17-minute talk entitled Algorithmic Illusions: Hidden Biases of Big Data.

During that presentation, she cautioned that data and collections of data are not objective. They are created and shaped by human beings, and understanding the unavoidable hidden biases people bring to data collection and analysis can be as significant as the data themselves.

Now, on the heels of that appearance, Crawford is bringing a similar message to a different audience, that of the Harvard Business Review, which has just published her contributed article, (opens in new tab), that underscores the concepts she discussed during Strata 2013.

In this brief but compelling article, Crawford raises questions that need to be raised when examining a big data set: “… which people are excluded? Which places are less visible? What happens if you live in the shadow of big data sets?”

I had to know more, so I contacted her. The first thing I wondered about was how she began investigating the biases inherent in what increasingly is being invoked by the term “big data.” As it turns out, her investigation had been prompted by one of the world’s biggest natural disasters in recent memory.

“I became fascinated with the insights and limits of big data when I started working with large sets of social-media data,” Crawford says, “most particularly while working on crisis communications projects back in 2010 and 2011, when Australia was experiencing the worst flooding on record. Collaborating with a team of social scientists, we were tracking tweets about the floods, seeking to understand communications patterns.

“People were using Twitter to share information and to squash rumors and to thank emergency services. The majority of tweets came from Queensland’s capital city of Brisbane, but the most substantial damage and loss of life was in smaller towns and rural areas.”

The preponderance of city dwellers’ tweets meant that their observations were overshadowing the experiences of those most directly affected. The data set reflected a sort of bias.

“It couldn’t give us insight about the experiences in areas where people were cut off from telecommunications and power—or simply not using Twitter,” adds Crawford, also a visiting professor at the MIT Center for Civic Media.

“Then, in mid-2011, danah boyd (opens in new tab) and I co-authored a paper for the Oxford Internet Institute’s conference (opens in new tab) that articulated some of our concerns about big data and social media. There was very little around at the time that was asking critical questions of how big data was being used.“

The specific example of the Queensland floods would seem to point to the Twitter platform as being the culprit, but that’s not necessarily the case.

“Social-media data is one small slice of all the data that is out there,” Crawford says. “The same can be said for sensor data. But these are just examples of a bigger problem: Data sets from any source will have gaps and problems. There is no such thing as a data set that is untouched by human design: We decide what counts as data and what does not. Or, as Lisa Gitelman [media historian at New York University] has described, ‘Data need to be imagined as data to exist.’

“Big data is still subjective. It is still informed by disciplinary perspectives and the ever-changing histories of knowledge. Regardless of where the data come from, it’s useful to ask about the grounding assumptions, the methods, and the possible errors.”

Crawford offers a novel mechanism for enhancing the value of big data.

“Multidimentional data—data with depth, as I call it—can come from using mixed research methodologies: combining big-data analytics with small data studies that bring out the depth, nuance, and context that big data often misses. Small data can also produce rich insights and different perspectives that are left out or are unreachable by big-data studies.

“But above all, social-science approaches help us to ask productive questions about data to prevent us from falling victim to our own cognitive biases that often suggest answers we expect or lead us to results we wish to find.”

Microsoft Research Blog

Microsoft Research Newsletter