The Knowledge Discovery and Data Mining (KDD) process consists of data selection, data cleaning, data transformation and reduction, mining, interpretation and evaluation, and finally incorporation of the mined “knowledge” with the larger decision making process. The goals of this research project include development of efficient computational approaches to data modeling (finding patterns), data cleaning, and data reduction of high-dimensional large databases. Methods from databases, statistics, algorithmic complexity, and optimization are used to build efficient scalable systems that are seamlessly integrated with the Relational/OLAP database structure. This enables database developers to easily access and successfully apply data mining technology in their applications.
This is a long-term project. In the short term, the focus will be on automating the data mining process over data warehouses. This includes work in the following areas:
- Integration of data mining with database systems: Success of data mining as an enterprise technology crucially depends on seamless integration of this technology with enterprise databases. In this project, in collaboration with the SQL Server Product Group, we identify opportunities for new abstractions and interfaces that enable integration of data mining. Our joint work resulted in defining OLE-DB DM, an extension of OLE-DB that exposes data mining functionality. Our future work will focus on exploiting data mining for advanced data summarization and also enable tighter coupling between database querying and data mining.
- Scalable Data Mining Algorithms: We are exploring scalable algorithms for modeling large databases. Methods considered include those for predictive modeling (predicting products a customer is likely to purchase based on other products in their basket) and segmentation/clustering (grouping together customers that are “similar” to each other). Specifically, we have focused on scalable decision tree algorithms for prediction, scalable probabilistic clustering algorithms, similarity detection algorithms between data objects, and mining sequence data. We are particularly interested in efficiently building data mining models in linear or near-linear time.
Panning for Gold
Mining has always been associated with dark, bottomless pits and workers who didn’t see the light of day for hours at a time. Integrating a data mining engine into Microsoft® SQL Server™ 2000 wasn’t quite that daunting– but the researchers involved did face some challenges.
Data mining is part of a larger process called Knowledge Discovery in Databases (KDD). The discovery part of the process – the part that finds gold among the gigabytes-is data mining. But before you can pull out your tin pan and shake it for gold, you need to gather your data into a data warehouse.
Most major organizations have datawarehouses containing information about their clients, competitors and products. These huge data warehouses contain gigabytes with “hidden” information that can’t be easily found using typical database queries, giving rise to the myth that the more data you have, the less you know. Data mining algorithms change all that by finding interesting patterns that you didn’t even know were there – like a prospector who discovers gold while trying to build a sawmill.
How will this help me?
Imagine you’re a storeowner who has been using a database to determine which items sell the best and which sell the worst. Typically you’d stop stocking the lowest selling items. However, data mining might determine that the customers who spent the most dollars at your store bought the lowest selling items. To keep your most valuable customers happy you’d need to keep stocking the rarely purchased marinated fish heads. A standard database query wouldn’t find this association, because you hadn’t asked it the right question.
Credit card companies have discovered another use for data mining. Before data mining, if you wanted to determine fraudulent transactions using a database you would query the database for all the transactions that had been determined fraudulent. You’d get back a few thousand records. Then you’d ask for all the transactions that were considered valid and you’d get ten or a hundred million. Your next task would be to stare at hundreds of thousands of variables and try to decide which of the variables predicted fraud. Humans can handle two or three variables, but beyond that it’s over. Data mining algorithms structure the data and determine which attributes are relevant in a matter of minutes.
SQL Server gets more power
Until now, you had two choices: ignore the data you couldn’t find or hire a statistician to apply algorithms to your data. That’s all changed, due to the marriage of research and product groups at Microsoft. The Data Management, Mining and Exploration (DMMX) group integrated a data mining engine into SQL Server 2000, which will make panning for gold as easy as writing a SQL Server query.
In the past, data mining tools used different data formats from those available in relational or OLAP (multidimensional) database systems. The data mining extensions in SQL Server 2000 will provide a common format for applications such as statistical analysis, pattern recognition, data prediction and segmentation methods, and visualization products.
The data mining engine in SQL Server 2000 is a powerful platform. Though it ships with two algorithms, it is extensible and supports data mining algorithms that you might build. For instance, if someone came along with a cool algorithm to predict which stocks would do well, they could plug in and leverage this algorithm into the OLE DB for Data Mining API in SQL Server. In addition, the algorithms already integrated into SQL Server 2000 will make it easy for developers using the Visual Basic® development system and SQL Server 2000 to use data mining for analysis, prediction, and reporting purposes, without having to consult a specialist in data mining.
The two algorithms shipped with SQL Server 2000 are a scalable decision tree algorithm and a scalable clustering algorithm. A decision tree algorithm is meant to solve prediction problems. For instance, you might want to predict whether a high school student is going to go to college. If you have a database that contains information about people who did and did not go to college, the decision tree algorithm can use this data to learn rules to make predictions about new input. The rules can also tell you the percentage of probability of the prediction occurring.
Another nice thing about decision trees is that they’re interpretable. Someone using the system will be able to tell what rules where used to determine the prediction. Other predictive modeling methods, such as neural networks, are a bit more like a crystal ball, you just feed in the data and the prediction magically appears.
Fast handling of large data sets
Even though decision tree algorithms have been around for a while, they haven’t worked well if the training data set is too large. Members of Microsoft Research and members of the SQL Server team at Microsoft came up with some clever techniques to pull the data out of SQL Server and quickly build decision trees from large sets of data.
The clustering algorithms identify maximum similarities within a group, as well as maximum differences between groups. Customers may be grouped or segmented into those most likely to buy a certain product at certain times and under certain conditions. The resulting grouped clients are called clusters. Online stores who cluster their clients will recommend products to their customers based on past purchases. They may recommend you buy products that other people in your cluster have bought, guessing that your predilection for buying hot pink fuzzy slippers means you’ll also want the lime-green leg warmers-since the clustering algorithm showed that 9 out of 10 other hot pink fuzzy slipper buyers did.
Statisticians have known about clustering algorithms for decades, however, most of the popular algorithms that are easy to implement will run quickly over small sets of data, but break down when applied to large sets. The main problem is in the design. The algorithms run over and over until the groupings are found, and they may require many scans of the database at each iteration. If a business has a large database or one that is spread out over different servers, pulling the data together for even one customer is non-trivial. It could take days to obtain information about your market clusters.
The scalable clustering algorithm in SQL Server 2000 clusters the database with one scan of the data. This helps address the computational difficulties of collecting data spread throughout an organization on different servers, since the data needs to be read only once. Right now there’s no standard way to cluster customer information that is spread across databases. If a database supports OLE DB for Data Mining, you only need to specify once how to pull everything together. After that, the data mining specification does it for you.
The algorithms also solve problems with high dimensional, sparse data. High dimensional data contains millions of data points in thousands of dimensions. Sparse data means that each entity has only a few of the characteristics that are being measured. Clustering documents is one application of this algorithm.
What’s next for the data mining group? They’re very interested in analyzing sequences that occur in data. They’d like to answer questions about how and in what order people browse a web site. They’re investigating algorithms which take into account the time that these events occur. This information will be useful to the thousands of dot coms hoping to get your business by serving up the content that you want when you need it, instead of making you slog through pages and pages of ever increasing data. Without the power of data mining, searching for information is like panning for gold without a pan, and might yield fool’s gold instead of valuable facts.
Principal Software Architect