Language Differences and Metadata Features on Twitter

Web N-gram Workshop at SIGIR 2010 |

Published by ACM

In the past several years, microblogging services like Twitter and Facebook have become a popular method of communication, allowing users to disseminate and gather information to and from hundreds or thousands (or even millions) of people, often in real-time. As much of the content on microblogging services is publicly accessible, we have recently seen many secondary services being built atop them, including services that perform significant content analysis, such as real-time search engines and trend analysis services. With the eventual goal of building more accurate and less expensive models of microblog streams, this paper investigates the degree to which language variance is related to the metadata of microblog content. We hypothesize that if a strong relationship exists between metadata features and language then we will be able to use this metadata as a trivial classifier to match individual messages with specialized, more accurate language models. To investigate the validity of this hypothesis, we analyze a corpus of over 72M Twitter messages, building language models conditioned on a variety of available message metadata.