Once you have download the toolkit, you can step through the online tutorial to gain familiarity with the tools. We recommend that you complete this tutorial, which only takes a few minutes, before using the tools for other applications.
You can convert raw data into the WinMine XML data format by running the interactive conversion wizard DataConverter.exe and simply follow the steps. Run DataConverter.exe now, and we’ll fill in the appropriate fields.
The first screen you see looks like the following:
Select “Raw text file” in the Data source type combo box, and then specify the name of the input and output file. We’ll first convert the transactions data file “transactions.raw”:
Note that we’re using “xdat” as the data file extension for the toolkit data files.
Press “Next” to proceed to the “Input file characteristics” page. The transactional data file has comma-separated fields, and the first line does not contain column names. Notice that the wizard has correctly guessed that the file is transactional. That is, the data is specified using three columns: the first column specifies the case (or row) id, the second column specifies a variable name, and the third column specifies the value of the variable.
The “Retain Case IDs” check box is checked by default. If this box is cleared, the data file will be smaller, particularly if the case id is long (such as a GUID). Because we’re going to join this data with the demographics data, however, we must retain these ids so that we can match product purchases to demographics.
For some data files, particularly transactions files, all of the variables have the same domain. This information can be useful for the wizard when it guesses the types of the variables. We can assume that the domain is the same for all products (it is simply the number of copies purchased) so check this box now:
Press “Next” to proceed to the “Initial Scan” page. When you press “Next” on this page, the data will be scanned and the wizard will guess the types and names of the variables. After the scan is done, the wizard automatically advances to the “Edit Variables” page, and you have the option to edit the domains of the variables by pressing the “Launch the VariableEditor application” button. Do so now, and click on the “Encarta Encyclopedia 2000” variable in the “Variable” column.
The wizard has guessed that the variable (and each other variable) has three states, and it has defined the NULL state to be “Missing”. What this means is that when an explicit value for “Encarta Encyclopedia 2000” does not occur for a customer, the value is assigned to be “Missing”.
Because this is product-purchase data, we will change the name of the NULL state to reflect a non-purchase. Multiple-select all of the variables by clicking on “Encarta Encyclopedia 2000”, scrolling down to the bottom of the list, and then shift-clicking on “Visual J++ Professional Edition”. Change the name of the state “Missing” to “Didn’t purchase” by clicking on the name of the state, and then either clicking it once again, or by using the menu (Edit -> State Name). The dialog should now look like this:
(If the bottom half of the window does not have the list of states, you probably forgot to check “All Variables Have the Same Domain” on the “Input file characteristics” page; to get the same results as in the tutorial, you should re-import your data.)
Return to the wizard using the menu sequence “File->Save Changes and Return to wizard”. You can also abort your changes and try again.
Once the variable domains have been defined, the wizard needs to perform a final pass over the data. Press “Next” on the “Edit Variables” page to proceed to the “Data Conversion” page. When you press “Next” again, the data is scanned a final time, and the wizard advances to its final page. Press “Finish” to complete the wizard.
Run the wizard again, but this time specify “demos.raw” and “demos.xdat” for the raw and xml data files, respectively. Press “Next” to proceed to the “Input file characteristics” page, which should look like:
Note that the wizard has incorrectly guessed that “demos.raw” contains transactional data. Uncheck the “Transactional Data” box, and then check the “First Row Contains Variable Names” box. Again, be sure to keep the “Retain Case IDs” checked so that we can join this file with the transactions. The page should now look like this:
Press “Next” to proceed to the “Initial Scan” page, and then press “Next” on this page to scan the data. After the wizard advances to the “Edit Variables” page, press the “Launch the VariableEditor application” button to view the guessed domains of the variables:
Return to the wizard without making any changes. Press “Next” on the “Edit Variables” page to proceed to the “Data Conversion” page, press “Next” again to scan the data a final time, and press “Finish” on the final page to complete the wizard.
If you would like to get a quick summary of a data file, you may use DataCheck.exe. This tool is also useful for checking the syntax of an XML data file that was not generated with the WinMine tools.
Try running DataCheck.exe on the transactional data file:
datacheck -data transactions.xdat
The output should be:
Variables: 30 Cases: 1433 Flat dimension: 42990 Entries: 5845 Density: 0.135962
The flat dimension is simply the number of variables times the number of cases. ‘Entries’ refers to the number of values that the tools use to store the data in memory. The density is the number of entries divided by the flat dimension. You may also use DataCheck.exe to accumulate and report marginal counts for the variables.
You should now have two xml data files “transactions.xdat” and “demos.xdat”, containing transactions and demographics, respectively, for hypothetical customers of Microsoft products. We will now join these two data files into one that contains, for each customer, both products and demographics. We use the command-line executable DataJoin.exe to perform this join.
As with all of the command-line executables, you may (1) run the executable with no arguments to get a list of the required and optional arguments, (2) specify the arguments interactively by providing the single argument ‘-gui’, and (3) get help information by providing the single argument ‘-help’.
There are four different methods that can be used to join the two files together, corresponding to the four different ways to define what customers will be in the output data: (1) only customers that are in “transactions.xdat”, (2) only customers that are in “demos.xdat”, (3) only customers that are in both, or (4) the union of all customers in both. Assuming that “transactions.xdat” is listed as the first input file (‘-input1’) and “demos.xdat” is listed as the second input file (‘input2’), we obtain these different joins by specifying ‘-left’, ‘-right’, ‘-inner’, or ‘-outer’, respectively, as a command-line flag to DataJoin.exe.
Let’s include in the join all customers from both files (an outer join). Recall that the NULL state for all variables in the transactions data was mapped to “Didn’t purchase”; this means that any customer in the demographics data who did not appear in the transactions data will have the value “Didn’t purchase” for all transaction variables. Similarly, those customers that are included in only the transactions data will have missing values for the demographic variables.
We perform the join using:
DataJoin.exe -datain1 transactions.xdat -datain2 demos.xdat -dataout combined.xdat -outer
Note that we’re naming the output file “combined.xdat”. If we want to join the output file with yet another data file, we must supply DataJoin.exe with the ‘-keepids’ flag so that the output file contains the customer ids; by default, DbJoin.exe does not preserve these ids.
Now that we have a single data file with all of the customer data, we want to split it into a training set and a test set. To do this, we use the command-line executable DataSplit.exe. To perform a 70/30 train/test split, we use:
DataSplit.exe -input combined.xdat -train train.xdat -test test.xdat -ftrain 0.7
Now that we’ve prepared the training data, we can build a plan file that instructs the learning algorithm how to model each variable. In particular, there are three pieces of information that the learning algorithm can use for each variable:
- The role of each variable. Each variable can be an input variable (one that is used only to predict other variables), an output variable (one that is only predicted by other variables), an input-output variable (both predicted and used to predict), or ignored (not used). You may also specify the role as marginal-ignored or marginal-input, which is the same as ignored or input, respectively, except that a marginal distribution is constructed for the variable.
- The distribution used for each variable. The distribution specifies both (1) whether a table or a tree will be used to represent the distribution and (2) the “local” distribution used within the table or tree. In the current version of the toolkit, table distributions can only be used for discrete variables and the only allowed local distribution is the multinomial. For tree distributions, the local distributions allowed depend on whether the variable is discrete or continuous. For a discrete variable, the local distributions can be either multinomials or binary-multinomials (multinomials that also model missing). For a continuous variable, the local distributions can be Gaussians, binary-Gaussians, log-Gaussians, or binary-log-Gaussians. Whenever there are missing values in the data, the binary version of a local distribution should be used.
- Model-as-binary information (tree distributions only). For each variable, you may specify that the learning algorithm should concentrate on a binary version of that variable when constructing the tree distribution. For example, we may have a continuous variable, but we’re interested in modeling whether or not that variable is missing. The final distribution, however, is still defined over all of the range of the variable. For discrete variables, you may model missing vs. non-missing values, or one state vs. all other states. For a continuous variable, you may model missing vs. non-missing.
You may also use the plan file to specify partial-order constraints for a Bayesian network or to specify dependencies that you would like to forbid the learning algorithm from considering.
You may run the model-building tools without specifying a plan file. In this case, all of the variables are input-output variables, tree distributions are used for all variables, and both the local distribution and the model-as-binary information for each variable is automatically chosen using the values in the data. If the domain consists entirely of discrete variables, you may also build a Bayesian network with table distributions without a plan file.
The easiest way to construct a plan file is to run the interactive PlanEditor.exe. Run this executable, and click on the File menu:
If you have an existing plan file, you can open it now and edit the information. Otherwise, you can create a new plan file from a dataset.
Choose “Create From Data” in the File menu, and choose “combined.xdat” as the data file. The application loads the data and makes guesses for the distribution types and the model-as-binary information. The default roles of all variables are input-output:
The initial guesses used by PlanEditor.exe for a given dataset are the same as the guesses made by the modeling tools if no plan file is supplied. Note that we supplied the entire dataset to PlanEditor.exe, whereas we typically only use the training portion when using the modeling tools.
The local distribution types are guessed as follows (all distributions default to trees): for a continuous variable, either Gaussian or log-Gaussian is chosen, depending on which local distribution best fits the marginal data. If there are any missing values, the “binary” version of the distribution is used. For categorical variable, either a multinomial or a binary-Multinomial is used, depending on the absence or presence, respectively, of missing values.
The model-as-binary information is guessed as follows: if the variable is missing in at least 20% of the data records, then the guess is to model missing vs. non-missing. Otherwise, if it is a discrete variable and the most popular state occurs at least twice as often as the next most popular state, then the guess is to model the most popular state vs. the rest of the states. Otherwise, the guess is to not model the variable as binary.
To change the guess for a variable, simply select it in the list, and use the combo-boxes at the bottom to change the appropriate information. You can also select multiple variables and change them at the same time.
Assume that we’re most interested in predicting whether or not a person will buy each product, and that we’re not concerned with how many copies of that product he purchases; this is the model-as-binary guess that was made for all variables except for “Money 2000 Business & Personal”. We can instruct the model-building algorithm to explicitly use the purchase/not purchase distinction for “Money 2000 Business & Personal” by selecting that variable in the list, and then in Model-as-binary combo boxes, change the Type to “State vs. others” and change the State to “Didn’t purchase”:
We can impose a partial order on the variables if we’re learning a Bayesian network. To do so, click on the “Partial Order” tab and specify the desired “Before/After” relationships. For this tutorial, we will not impose such an order.
We can prevent the learning algorithm from identifying any pair-wise dependency by specifying the appropriate edge in the “Forbidden Edges” tab. Let’s assume that we do not want to predict “Visual C++ 6.0 Professional Edition” using “Visual J++ 6.0 Professional Edition”. Click on the “Forbidden Edges” tab, select “Visual C++ 6.0 Professional Edition” from the “From” list, select “Visual J++ 6.0 Professional Edition” from the “To” list, and then press the “Add Forbidden Edge(s)” button.
Now save this plan file as ‘train.xplan’, and then exit the application. We’re ready to learn a model!
We will now construct a model for the domain using Dnet.exe. This executable learns either a dependency network or a Bayesian network. In the current version of WinMine, there are some significant restrictions when learning models with table distributions. In particular, only Bayesian networks can be learned, the role for every variable must be input-output, and the model-as-binary type for every variable must be none.
A dependency network is simply a set of conditional probability distributions for all of the variables that are specified (in the plan file or implicitly) as output variables or input-output variables. Trivial decision trees, consisting of a single root node, are also constructed for the marginal-ignored and marginal-input variables. A Bayesian network is similar, except that the set of distributions is constructed so that it defines, via simple multiplication, a joint distribution over the corresponding variables.
To learn a dependency network, supply the input training data, the input plan file, and the output model file as follows:
Dnet.exe -data train.xdat -plan train.xplan -model DepNet.xmod
To learn a Bayesian network, supply the additional ‘-acyclic’ flag:
Dnet.exe -data train.xdat -plan train.xplan -model BayesNet.xmod -acyclic
(You can control how complex the model is using the ‘-kappa’ and ‘-min’ flags; run ‘Dnet.exe -help’ for details).
To learn a Bayesian network with tables as distributions, we need first to restrict the domain to contain only discrete variables. For simplicity, we will use the transactions data file “transactions.xdat” without first splitting it into a training and testing set. You can create a plan file and specify that all conditional distributions are tables. Alternatively, you can supply Dnet.exe with the “-tables” option as follows:
Dnet.exe -data transactions.xdat -tables -model Tables.xmod -acyclic -kappa 1.0
Note that because we did not use a plan file, the forbidden-edge constraint specified above will not be imposed. We increased the value of kappa from its default value of 0.01 in order to learn a denser model. When Dnet.exe is used to learn a Bayesian network with table distributions, it performs a greedy DAG-based search algorithm starting with the model containing no edges; edges are greedily added, deleted, and reversed until a local maximum is reached.
To look at a model that you have built, run DnetBrowser.exe, and open the corresponding xmod file. First try viewing “DepNet.xmod”. You should see a moving graph similar to the following:
For “DepNet.xmod” and “BayesNet.xmod”, Dnet.exe has constructed a decision tree for each variable in the domain. The graph you see is a summary of the set of all trees. In particular, there is a directed edge from node X to node Y if, in the decision tree for X, there is a split on Y.
To toggle on and off the node-layout algorithm, select View->Layout from the menu, or press the button. You can click a node to highlight the direct relationships between that node and the other nodes in the graph. You can drag nodes to manually adjust the layout.
The slider along the left side of the window enables you to control the number of edges exposed based on the order in which those edges would be added by Dnet.exe if it were to use a best-first search.
To view the details of a decision tree, simply double-click the variable of interest. For example, if you double-click “age”, you will see the following decision tree:
To see the details of a leaf distribution, double-click any leaf node, and the details dialog will appear:
You can click on any node in the decision tree, and the details dialog will update accordingly. Each leaf distribution for “age” is modeled with a Gaussian distribution: the number following “+ or -” is the standard deviation. For a log-Gaussian, the geometric mean is shown instead of the mean, and the geometric standard deviation is shown following a “* or /”.
The text in the leaf node (e.g. “7.17 to 26”) is a summary of the distribution: the first number is the mean minus one standard deviation, and the second number is the mean plus one standard deviation.
Close the details dialog and return to the graph overview by pressing the button.
Now double-click on “NFL Fever 2000”. The leaves in this tree show a histogram of the states of the corresponding multinomial distribution. You can double-click a leaf nodes to get the details of the distribution. If you hover the cursor over a leaf node, the name of the most probable state appears as a tool-tip.
Return to the graph overview again, and double-click “occupation”. The tree should look like:
There are multinomial distributions in the leaves, but because there are so many states of the target, a histogram was not used. Instead, the most probable occupation is shown in each leaf. As before, you can see the details of the leaf distributions by double-clicking on any leaf node.
Now return to the graph overview.
You can use the Node Finder dialog to help find variables in the model. Press the button, and the following dialog will appear:
By default, the viewer only displays 60 variables on the screen at a time. Because there are less than 60 variables in our domain, all of them are shown. If you are viewing a model with more variables, however, the nodes that are not displayed by default can be identified using the Node Finder dialog if “Show hidden nodes” is checked. If “Add Linked Nodes” is checked, then all hidden parents and children of the selected node are shown when “Go to Node” is pressed.
Within both the graph overview and the decision-tree view, you can zoom in and out, scroll around, and change the font sizes.
Try opening “BayesNet.xmod” and viewing some of the trees. Notice that this graph, in contrast to the first, contains no directed cycles.
Now open “Tables.xmod” to look at the Bayesian network with table distributions.
When you double click on a node, a dialog will pop up that specifies a separate multinomial distribution for each distinct set of values for the parent nodes. For example, if you double-click the “Picture It! 2000” node, the following distribution is shown:
Note that the numbers in each row sum to one.
When browsing table-distribution Bayesian networks, we have the option to view the reversible and compelled edges. Intuitively, reversible edges are ones that can occur in the opposite direction in some other Bayesian network that is equivalent (in terms of representational ability) to the current one; compelled edges are edges that are not reversible. To see the reversible and compelled edges, select the ‘View’ menu and select ‘Compelled Edges’.
The compelled edges remain directed and are colored red; the reversible edges are now undirected and colored greed. To go back to the previous view, select the ‘View’ menu select ‘Compelled Edges’ again.
Now that we have constructed some models, we can use DnetLogscore.exe to evaluate how well these models predict out-of-sample data.
DnetLogscore.exe works as follows. For each case in the test data, DnetLogscore.exe evaluates the log posterior probability of the value for each output variable, given the values of all other variables. The average of these log posteriors across all variables in all cases is reported.
You specify the output variables using a plan file; any variable tagged with either the input-output or the output role type will be evaluated. In almost all cases, if you provided a plan file to Dnet.exe, that file should be used with DnetLogscore.exe. The exception is that if you want to evaluate variables that had one of the marginal role types, you must change those to output or input-output in the plan file supplied to DnetLogscore.exe. The role type for each variable is the only information DnetLogscore.exe uses from the plan file. If no plan file is supplied to DnetLogscore.exe all variables are evaluated.
Lets first evaluate the predictive accuracy of the dependency network:
dnetlogscore -model DepNet.xmod -data test.xdat -plan train.xplan -report DepNet.rpt
(If you omit the ‘-report’ argument, the output will be written to the screen).
We did not need to provide the plan file because train.xplan assigns the input-output role to all variables, which is the default behavior of DnetLogscore.exe. It is good to get in the habit of always providing the plan file, however, because the plan file is necessary when not all variables are outputs.
The contents of DepNet.rpt should be:
Scoring model “DepNet.xmod” using data from “test.xdat” Log score = -0.449472
This means that on average, the log probability that each variable assigns to the given value in the test case, given the values of all other variables, is -0.449472. To compare models from different experiments, we can simply compare these log scores.
If you specify the ‘-mmcompare’ flag, DnetLogscore.exe constructs a marginal distribution for each target variable using the test data, and then reports the difference in log score between the provided model and the marginal model.. Because the marginal model is constructed with the test data, if the learned model for an output is itself a marginal model, then these distributions will typically be different. If you compare a learned model to a marginal model, the resulting log score has the following interpretation: a positive log score indicates the degree to which the model out-performs the marginal model on the test set; a negative log score indicates that the learning algorithm has probably over-fit the training data. Because the marginal distributions are constructed on the testing data, however, a more accurate way to measure the “lift over marginal” is to build explicitly a marginal model from the training data and perform the difference directly by evaluating both models with DnetLogscore.exe and then subtract the results.
To evaluate the lift over marginal with the ‘-mmcompare’ flag for the dependency network, use:
dnetlogscore -model DepNet.xmod -data test.xdat -plan train.xplan -report DepNetMM.rpt -mmcompare
The contents of DepNetMM.rpt should be:
Scoring model “DepNet.xmod” using data from “test.xdat” Log score = 0.113855
We can perform the same evaluations with the Bayesian network with decision trees by simply including the ‘-mblanket’ flag. This informs the executable to use “Markov-blanket inference” when computing the posterior of an output variable. That is, the probability is evaluated as a function of both the local distribution of the output variable, and the local distributions of its children.
To evaluate the log score of the Bayesian network, use:
dnetlogscore -model BayesNet.xmod -data test.xdat -plan train.xplan -report BayesNet.rpt -mblanket
The contents of BayesNet.rpt should be:
Scoring model “BayesNet.xmod” using data from “test.xdat” Log score = -0.426573
To evaluate the lift over marginal of the Bayesian network, use:
dnetlogscore -model BayesNet.xmod -data test.xdat -plan train.xplan -report BayesNetMM.rpt -mblanket -mmcompare
The contents of BayesNetMM.rpt should be:
Scoring model “BayesNet.xmod” using data from “test.xdat” Log score = 0.136754
Finally, we can evaluate the log score of the table-based Bayesian network using the same data with which we trained the model via:
dnetlogscore -model Tables.xmod -data transactions.xdat -plan train.xplan -report Tables.rpt -mblanket
dnetlogscore -model Tables.xmod -data transactions.xdat -plan train.xplan -report TablesMM.rpt -mblanket -mmcompare
if we want to compare to the marginal. These should yield a log score and lift of -0.342336 and 0.0889559, respectively.
If you would like to evaluate how well a Bayesian network (jointly) predicts the test data, you can use DnetLogscore.exe by simply omitting the ‘-mblanket’ flag; this evaluates the average log probability for each output variable given its parents in the network, which is proportional to the log probability of the test data. Be careful not to ever compare this number to the log score of a dependency network; DnetLogscore.exe cannot be used to evaluate how well a dependency network jointly predicts the output variables.