I am a Principal Researcher at Microsoft Research working on topics related to the processing of and inference from very large data sets, including (distributed) Database Systems, Cluster Management, Query Optimization, Statistical Techniques, Applied Machine Learning, Security Analytics and Information Retrieval.
My research has been published in the top conferences in Database Systems (SIGMOD, VLDB, ICDE), Artificial Intelligence (NIPS), Data Mining (SIGKDD), Information Retrieval (SIGKIR, WWW), and Human-Computer Interaction (CHI), gathered multiple best paper awards and has been invited as a ‘Research highlight’ to the Communications of the ACM.
This work has impacted a number of different products at Microsoft and become part of Windows Server, Microsoft Exchange Server, Bing Search, Bing Ads, Azure Active Directory Identity Protection, Office 365, and Azure Application Insights.
I am currently working as a member of the AutoAdmin and Data Exploration research projects in the Data Management, Exploration and Mining Group at Microsoft Research. I am also a member of the cross-group BLEWS project on blogs and news.
My current research is focused on scalable algorithms for processing and indexing very large data sets in the context of web search and computational advertising.
Recent work includes:
Cluster Management: currently, we are investigating how to improve state-of-the-art cluster managers such as Kubernetes or Service Fabric for large-scale DBaaS clusters.
Security Analytics: I co-designed the risk-scoring approach in Azure Active Directory Identity Protection, enabling automated detection and response to suspicious activity within an organization.
Time-Adaptive Sketches for large Datasets: here, we studied the problem of how to make several well-known sketching techniques ‘time-aware’, i.e., giving better estimates for recent items, while aging out older ones.
Use of Artificial Intelligence in Database Systems: we have investigated the use of statistical techniques in various contexts to drive tuning and scaling behavior in DBMS; we have applied this approach in the context of SQL Query Progress Estimation, estimation of SQL Query Resource Consumption and resource scaling for Database-as-a-Service offerings.
Service Intelligence: we study the use of statistical techniques in the context of monitoring, tuning and problem-diagnostics for large-scale ‘Cloud’ database instances.
B-bit Minwise Hashing: we proposed a technique that improves upon the standard minwise hashing method (as well as sign random projections, Hamming-LSH, etc.) for set similarity estimation by storing only storing b bits of each hashed value (e.g., b = 1 or 2); using a novel estimator we obtain order-of-magnitude improvements in the storage space required for a given level of accuracy in practice. Subsequently, we (a) extended the framework to three-way similarities, and (b) integrated (b-bit) minwinse hashing with linear learning algorithms such as linear SVM and logistic regression to solve large-scale and high-dimensional statistical learning tasks. A 3-minute video introduction to the technique is availabe here.
Fast Set intersection: Set intersection is a central operator in IR and data mining; we propose techniques that give novel asymptotic bounds and outperform the state of the art in practice, while being robust in that – for the cases where our approach is not the best – they are close to the best-performing one.
Integrating Vertical Content with Web Search: Current search engines surface a plethora of content other than web pages, such as advertisements, news, images, movies, ‘answers’, etc. Retrieving the appropriate ‘vertical’ content for a given query is an important research challenge. Recently, we studied frameworks for the detection of query intent, which enable the selection of relevant content types, the integration of news results in web search and the dynamic construction of ‘portal’ pages for a given query.
Improving Retrieval Latency: The perceived latency of search is of critical importance to the overall search experience. We have studied algorithms and index structures aimed at minimizing the worst-case latency when retrieving ‘vertical’ content, surfacing advertisements in sponsored search or displaying structured data about entities (such as celebrities, products, locations) related to a search query.
BLEWS (= blogs + news): In the BLEWS system we studied how to surface blog entries commenting on news stories as part of the news browsing experience. The BLEWS system shows which type of blogs are linking back to a specific story, how much ‘attention’ the story is getting and allows the user to quickly navigate to the comments themselves. We also studied the distribution of navigational patterns used to access social media content (i.e., what type of content do users typically read in blogs and how do they get there?).
Text classification: here, our work has focused on the scalable and robust extraction/categorization of entities from very large corpora and reducing the human overhead in text classification settings.
My prior work in the context of the management of databases has focused on a monitoring infrastructure for database servers, the scalable exploration of different database designs and various techniques for result-size estimation.