Twitter is much more than just cat pictures and what people eat for lunch! – it is a treasure trove of data about people’s life events, experiences, and opinions.
Recent research has started to look at how to use broader aggregate data to investigate and validate social issues such as employment, health and fiscal policy. A defining characteristic of this type of social policy research is the timeline and breadth of data involved. While most tweet analysis concentrates on a short sliding time window of the order of hours or days, extracting meaningful social policy trends typically involves looking at many months or even years of data.
With ~500 million new tweets (~2-3TB) been added to the Twitter data corpus daily, creating systems that can efficiently handle that massive volume of data is a challenging task. In the dsoap project, we are working on solutions for this “huge data” problem by applying intelligent compaction, pre-indexing and distribution of data across a cluster of machines to achieve reasonable query times for online data exploration.