Abstract

It has been widely observed that search queries are composed
in a very different style from that of the body
or the title of a document. Many techniques explicitly
accounting for this language style discrepancy have shown
promising results for information retrieval, yet a large scale
analysis on the extent of the language differences has been
lacking. In this paper, we present an extensive study on
this issue by examining the language model properties of
search queries and the three text streams associated with
each web document: the body, the title, and the anchor
text. Our information theoretical analysis shows that queries
seem to be composed in a way most similar to how authors
summarize documents in anchor texts or titles, offering a
quantitative explanation to the observations in past work.
We apply these web scale n-gram language models to
three search query processing (SQP) tasks: query spelling
correction, query bracketing and long query segmentation.
By controlling the size and the order of different language
models, we find that the perplexity metric to be a good
accuracy indicator for these query processing tasks. We
show that using smoothed language models yields significant
accuracy gains for query bracketing for instance, compared
to using web counts as in the literature. We also demonstrate
that applying web-scale language models can have
marked accuracy advantage over smaller ones.