Social networking analysis site Klout is in the business of measuring online influence. Vice President of Engineering David Mariani explains how deriving insight from large volumes of data is key to the company’s success.
How does Klout benefit from collecting data on its users’ online activities?
Data is at the heart of our business; it’s what makes the Klout value proposition possible. We process billions of user data signals – posts coming from Facebook, Twitter, LinkedIn, or blogs, as well as retweets, “likes,” and mentions – from the social web every day, and with that data we help users leverage their digital voice and help brands connect with key influencers. Just like search engines came up with page ranks to help people sort through all the documents on the web and surface the most relevant content for search, we do the same thing for people. We help influencers in a number of different areas reach their audience.
How do you make sense of all this data you collect?
There are billions of signals, and a lot of it is noise, so we have to pair analytics and science to filter out the noise. In order to do so, we store and process large quantities of very granular data with Hadoop, which is a Java-based programming framework that supports the processing of large data sets in a distributed computing environment. We don’t want to aggregate the data too soon, since you never really know what you need to do with the data until trends and insights reveal themselves. Hadoop is great for storing and processing lots of data cheaply, but it’s not great at doing any kind of interactive analysis or querying. So we use Microsoft SQL Server Analysis Services as a super index or cache that we put on top of Hadoop to be able to take advantage of all that raw data and query it at scale. We picked SQL Server Analysis Services (SSAS) because it’s a full-featured business intelligence engine that provides a true business view of our data in the form of a cube with measures and dimensions, delivering a rich semantic layer on top of raw, unstructured Hadoop data; it’s also inexpensive, has widespread query tool support, great documentation and most importantly it scales. If you have to run a query and go have a cup of coffee it’s not as valuable as being able to do it at the speed of thought and get answers the second you hit the keyboard. Our average query time is under 10 seconds, and we’re talking about accessing almost half a trillion rows of data.
Why is being able to mine this data so important?
Klout generates information from data. If we were just collecting and storing signals there’s no value. So we have to have a foundation for collecting and storing data, but it’s the science and intelligence that allows us to create information from data and gives Klout users and partners the value they’re looking for.
Can you give an example of insight you gained from analyzing data that has helped your business?
Sure, one trend we uncovered using Microsoft’s BI tools was that the average Klout user retweets 15 times more than a non-Klout user on Twitter, and a Klout user tends to have 13 times as many Twitter followers than non-Klout users. This means we can tell Twitter that a Klout user will be a lot more active on their network, so there’s more value created and more opportunity to monetize.
Also, we’re working on new consumer tools that use this business intelligence to identify exactly what kind of content has the most impact. So when you tweet something we can tell if someone else did something with it, or if it fell on deaf ears. It all comes down to using business intelligence to be able to identify what kind of content is valuable and important, and which is not.