Size Estimation of Approximate Predicates

  • Hongrae Lee | University of British Columbia

Ever-increasing amounts of text are produced by end-users or collected from multiple sources. Since such data typically has many errors and lacks standardized representation, similarity query processing has recently drawn significant interests; it has a wide range of applications including query refinement for web search, near duplicate document detection and elimination. In this talk, we will discuss size estimation of similarity queries, which is crucial in query optimization. We consider string/substring matching problems with edit distance, and the set similarity join problem. The proposed techniques are based on recent developments on similarity query processing such as Min-Hashing. We present how we can combine such techniques with frequent pattern mining and sampling to estimate size of similarity queries.

Speaker Details

Hongrae Lee is a PhD candidate in the Department of Computer Science at the University of British Columbia working with Dr. Raymond Ng. His research interests include data management, data mining and information retrieval. He interned at the DMX group at Microsoft Research in 2008 and 2009. He received his Masters in Computer Science from Seoul National University in 2004 and his Bachelor degree at the same university in 1997. Prior to his graduate studies, he worked for more than six years in software industry, where he participated/lead many projects in telecommunication companies and banks in Korea.

    • Portrait of Jeff Running

      Jeff Running