I am currently working as a member of the AutoAdmin and Data Exploration research projects in the Data Management, Exploration and Mining Group at Microsoft Research. I am also a member of the cross-group BLEWS project on blogs and news.
Before joining Microsoft, I completed my Ph.D. in Computer Science at the University of the Saarland, in the Database Research Group of Prof. Gerhard Weikum.
My current research is focused on scalable algorithms for processing and indexing very large data sets in the context of web search and computational advertising. Recent work includes:
Service Intelligence: we study the use of statistical techniques in the context of monitoring, tuning and problem-diagnostics for large-scale ‘Cloud’ database instances.
B-bit Minwise Hashing: we proposed a technique that improves upon the standard minwise hashing method (as well as sign random projections, Hamming-LSH, etc.) for set similarity estimation by storing only storing b bits of each hashed value (e.g., b = 1 or 2); using a novel estimator we obtain order-of-magnitude improvements in the storage space required for a given level of accuracy in practice. Subsequently, we (a) extended the framework to three-way similarities, and (b) integrated (b-bit) minwinse hashing with linear learning algorithms such as linear SVM and logistic regression to solve large-scale and high-dimensional statistical learning tasks. A 3-minute video introduction to the technique is availabe here.
Fast Set intersection: Set intersection is a central operator in IR and data mining; we propose techniques that give novel asymptotic bounds and outperform the state of the art in practice, while being robust in that – for the cases where our approach is not the best – they are close to the best-performing one.
Integrating Vertical Content with Web Search: Current search engines surface a plethora of content other than web pages, such as advertisements, news, images, movies, ‘answers’, etc. Retrieving the appropriate ‘vertical’ content for a given query is an important research challenge. Recently, we studied frameworks for the detection of query intent, which enable the selection of relevant content types, the integration of news results in web search and the dynamic construction of ‘portal’ pages for a given query.
Improving Retrieval Latency: The perceived latency of search is of critical importance to the overall search experience. We have studied algorithms and index structures aimed at minimizing the worst-case latency when retrieving ‘vertical’ content, surfacing advertisements in sponsored search or displaying structured data about entities (such as celebrities, products, locations) related to a search query.
BLEWS (= blogs + news): In the BLEWS system we studied how to surface blog entries commenting on news stories as part of the news browsing experience. The BLEWS system shows which type of blogs are linking back to a specific story, how much ‘attention’ the story is getting and allows the user to quickly navigate to the comments themselves. We also studied the distribution of navigational patterns used to access social media content (i.e., what type of content do users typically read in blogs and how do they get there?).
Text classification: here, our work has focused on the scalable and robust extraction/categorization of entities from very large corpora and reducing the human overhead in text classification settings.
My prior work in the context of the management of databases has focused on a monitoring infrastructure for database servers, the scalable exploration of different database designs and various techniques for result-size estimation.