Abstract

The Web and online search engines have greatly simplified information access. This has led to advantages in many areas, including education, disaster management, science, and community development. However, along with these advantages, several challenges have arisen, such as those related to data redundancy, query construction, the ethical use of the Web, and the design of appropriate evaluation methods. This dissertation focuses on two general problems in information retrieval: similarity and satisfaction. Near duplication is common in document collections and refers to the case where a large amount of similarity exists among documents. This dissertation focuses on near duplicate detection in scholarly big data and state of the art methods from the Web are shown to be effective at detecting near duplicate scholarly documents. These findings are used in the design of an information extraction Web service that was designed to be scalable and efficient when processing scholarly big data. The Web service includes a near duplicate matching backend to avoid redundant information extraction and is shown to lead to an 8.46% decrease in the amount of time required to extract metadata and citations from 3.5 million academic documents. Similarity search is similar to near duplicate detection; however, instead of identifying all near duplicates, the goal is instead to find documents that are similar to a given query document. This is especially useful in situations where it is challenging to construct keyword queries for complex information needs. A similar document search engine that receives whole documents as queries and automatically finds similar files is proposed. The search engine is scalable and works with multiple similarity functions and document collections. It includes a recursive search algorithm that produces a search result tree that is used for ranking and that leads to a significant improvement in search performance. There are many uses for similarity search on the Web. In this dissertation, a method for using similarity search to detect candidate sources of plagiarism from the Web is proposed. A single document is received as a query and potential sources of plagiarism are returned. The method achieves F-1 scores of 0.54 and 0.47 in offline and online evaluations, respectively. Similar methods are presented for detecting synthetic scientific articles and achieve precision and recall scores of 0.96 and 0.99, respectively. Finally, evluation is an important topic underlying much of information retrieval. Methods for measuring good abandonment in mobile search are presented, where good abandonment refers to users being satisfied in search without the need to click on results. Using gestures as signals, an accuracy of 75\% is achieved when differentiating between good and bad abandonment. Furthermore, is it shown how good abandonment is driven by mobile answers, snippets, and images on the results page.