If search and Twitter data are to be treated as a survey, they would follow a very peculiar methodology: participation is a time-varying, demographically biased sample of the population, participants are effectively continuously answering different “survey” questions, and, finally, participants can choose how often they are allowed to answer the question. In response, we show alternative methods for interpreting and using online and social media data fruitfully.
There is a large body of research on utilizing online activity to predict various real world outcomes, ranging from outbreaks of influenza to outcomes of elections. There is considerably less work, however, on using this data to understand topic-specific interest and opinion amongst the general population and specific demographic subgroups, as currently measured by relatively expensive surveys. Here we investigate this possibility by studying a full census of all Twitter activity during the 2012 election cycle along with comprehensive search history of a large panel of internet users during the same period, highlighting the challenges in interpreting online and social media activity as the results of a survey. As noted in existing work, the online population is a non-representative sample of the offline world (e.g., the U.S. voting population). We extend this work to show how demographic skew and user participation is non-stationary and unpredictable over time. In addition, the nature of user contributions varies wildly around important events. Finally, we note subtle problems in mapping what people are sharing or consuming online to specific sentiment or opinion measures around a particular topic. These issues must be addressed before meaningful insight about public interest and opinion can be reliably extracted from online and social media data.Latest version (May 15, 2014)