Abstract

The need to find similar documents occurs in many settings, such as in plagiarism detection or research paper recommendation. Manually constructing queries to find similar documents may be overly complex, thus motivating the use of whole documents as queries. This paper introduces SimSeerX, a search engine for similar document retrieval that receives whole documents as queries and returns a ranked list of similar documents. Key to the design of SimSeerX is that is able to work with multiple similarity functions and document collections. We present the architecture and interface of SimSeerX, show its applicability with 3 different similarity functions and demonstrate its scalability on a collection of 3.5 million academic documents.