# Uncertainty 99

## Home

Uncertainty 99 was the Seventh International Workshop on Artificial Intelligence and Statistics and was presented by The Society for Artificial Intelligence & Statistics.

## Program Committee

• Russell Almond, Educational Testing Service
• Chris Bishop, Microsoft Research
• Wray Buntine, Ultimode Systems
• Peter Cheeseman, NASA Ames
• Max Chickering, Microsoft Research
• Paul Cohen, University of Massachusetts
• Greg Cooper, University of Pittsburgh
• Philip Dawid, University College London
• David Dowe, Monash University
• William DuMouchel, AT&T Labs
• Sue Dumais, Microsoft Research
• David Edwards, Novo Nordisk
• Doug Fisher, Vanderbilt University
• Nir Friedman, Hebrew University–Jerusalem
• Dan Geiger, Technion
• Edward George, University of Texas
• Clark Glymour, Carnegie-Mellon University
• Moises Goldszmidt, SRI International
• David Hand, Open University
• Geoff Hinton, University of Toronto
• Tommi Jaakkola, MIT
• Michael Jordan, UC Berkeley

• Michael Kearns, AT&T Labs
• Daphne Koller, Stanford University
• Steffen Lauritzen, Aalborg University
• Hans Lenz, Free University of Berlin
• David Lewis, AT&T Labs
• David Madigan, University of Washington
• Andrew Moore, Carnegie-Mellon University
• Daryl Pregibon, AT&T Labs
• Thomas Richardson, University of Washington
• Alberto Roverato, Universita di Modena
• Lawrence Saul, AT&T Labs
• Ross Shachter, Stanford University
• Richard Scheines, Carnegie-Mellon University
• Sebastian Seung, MIT
• Prakash Shenoy, University of Kansas
• David Spiegelhalter, MRC–Cambridge
• Peter Spirtes, Carnegie-Mellon University
• Milan Studeny, Academy of Sciences of Czech Republic
• Nanny Wermuth, Mainz University

## Monday, January 4

Time Session Speaker
7:30–8:45

Registration/Continental Breakfast

8:45–9:00

David Heckerman and Joe Whittaker
9:00–11:00

Session I: Model Choice
Chair: Thomas Richardson

Process-oriented evaluation: The next step Pedro Domingos
Model choice Alan Gelfand and Sujit Ghosh
A note on the comparison of polynomial selection methods Murlikrishna Viswanathan
Pattern discovery via entropy minimization Matthew Brand
11:00–11:30 Break
11:30–12:30 Session II: Latent variables
On the geometry of DAG models with hidden variables Dan Geiger, David Heckerman, Henry King, Chris Meek
Efficient structure search in the presence of latent variables Thomas Richardson, Heiko Bailer, Mooulinath Bannerjee
12:30–1:30 Lunch (provided)
1:30–5:00 Break
5:00–6:00 Poster Summaries (2 mins/poster)
Chair: Joe Whittaker
6:00–7:00 Dinner
7:00–9:30 Poster Sessions

## Tuesday, January 5

Time Session Speaker
8:00–9:00

Continental Breakfast

9:00–10:00

Session III: Theory
Chair:

Conditional products: an alternative approach to conditional independence

Phil Dawid, Milan Studeny
Hierarchical mixtures-of-experts for generalized linear models: some results on denseness and consistency Wenxin Jiang, Martin Tanner
10:00–10:30 Coffee Break
10:30–11:30 Session IV: Regression
Chair:
Boosting methodology for regression problems Greg Ridgeway, David Madigan, Thomas Richardson
Probabilistic kernel regression models Tommi Jaakkola, David Haussler
11:30–12:30 Session V: Computational Methods
Chair:
Learning structure from data efficiently: applying bounding techniques Nir Friedman, Lise Getoor
Efficient mining of statistical dependencies Tim Oates, Paul Cohen, Casey Durfee
12:30–2:00 Lunch (provided)
2:00–3:30 Session VI: Applications
Causal mechanisms and classification trees for predicting chemical carcinogens Louis A Cox
Geometric modelling of a nuclear environment Jan De Geeter
Modeling decision tree performance with the power law Lewis Frey, Doug Fisher

## Wednesday, January 6

Time Session Speaker
8:00–9:00

Continental Breakfast

9:00–10:30

Session VII: Inference
Chair:
Greg Cooper

Model-independent mean field theory as a local method for approximate propagation of information

Michael Haft, Reimar Hofmann, Volker Tresp
Hierarchical IFA belief networks Hagai Attias
Stochastic local search for Bayesian network Kalev Kask, Rina Dechter
10:30–11:00 Coffee Break
11:00–12:00 Session VIII: Applications
Chair:
Doug Fisher
An experiment in causal discovery using a pneumona database Peter Spirtes, Greg Cooper
Bayesian graphical models for non-compliance in randomaized trials David Madigan
12:00–12:15 Closing Remarks

Plenary Presentations lasted 25 minutes with 5 minutes for questions.

## Tutorials | January 3

#### Information Access and Retrieval

Speaker: Susan Dumais, Microsoft Research | 8:30 AM – 10:30 AM

The Web has made literally terabytes of information available at the click of a mouse. The challenge is in finding the right information. Information retrieval is concerned with providing access to textual data for which we have no good formal model, such as a relational model. Statistical approaches have been widely applied to this problem. This tutorial will provide an overview of: a) statistical characteristics of large text collections (e.g., size, sparcity, word distributions), b) important retrieval models (e.g., Boolean, vector space and probabilistic), and c) enhancements which use unsupervised learning to model structure in text collections, or supervised learning to incorporate user feedback. We will conclude with a discussion of open research issues where improved statistical models can improve performance.

#### Bayesian Statistical Analysis

Speaker: David Spiegelhalter, MRC Biostatistics Unit, Institute for Public Health, Cambridge | 11:00 AM – 12:00 AM and 1:30 PM – 2:30 PM

The first part of the tutorial will cover the fundamentals of Bayesian inference, including probability and its subjective interpretation, evaluation of probability assessments using scoring rules, utilities and decision theory. The use of Bayes theorem for updating beliefs will be illustrated for both binomial and normal likelihoods, and the use of conjugate families of priors and predictive distributions described. The First Bayes software will be used to display conjugate Bayesian analysis. The second part will introduce the concept of `exchangeability’, and the consequent use of hierarchical models in which the unknown parameters of a common prior are included in the model. Conditional independence assumptions lead naturally to a graphical representation of hierarchical models. Markov chain Monte Carlo (MCMC) methods will be introduced as a means of carrying out the necessary numerical integrations, and topics covered will include the relationship of Gibbs sampling to graphical modelling, parameterisation, initial values, and choice of prior distributions. Real examples will be used throughout, and on-line analysis of an example in longitudinal modelling with measurement error on predictors will be carried out using the WinBUGS program.

#### Additive Logistic Regression: A Statistical View of Boosting

Speaker: Trevor Hastie, Stanford University | 3:30 PM – 5:00 PM

Boosting (Freund and Schapire, 1995) is one of the most important recent developments in classification methodology. Boosting works by sequentially applying a classification algorithm to reweighted versions of the training data, and then taking a weighted majority vote of the sequence of classifiers thus produced. For many classification algorithms, this simple strategy results in dramatic improvements in performance. We show that this seemingly mysterious phenomenon can be understood in terms of well known statistical principles, namely additive modeling and maximum likelihood. For the two-class problem, boosting can be viewed as an approximation to additive modeling on the logistic scale using maximum Bernoulli likelihood as a criterion. We develop more direct approximations and show that they exhibit nearly identical results to boosting. Directmulti-class generalizations based on multinomial likelihood are derived that exhibit performance comparable to other recently proposed multi-class generalizations of boosting in most situations, and far superior in some. We suggest a minor modification to boosting that can reduce computation, often by factors of 10 to 50. Finally, we apply these insights to produce an alternative formulation of boosting decision trees. This approach, based on best-first truncated tree induction, often leads to better performance, and can provide interpretable descriptions of the aggregate decision rule. It is also much faster computationally, making it more suitable to large scale data mining applications.

* joint work with Jerome Friedman and Rob Tibshirani

## Abstracts | January 4 – 6

#### Hierarchical IFA Belief Networks

Author: Hagai Attias, Gatsby Computational Neuroscience Unit, University College London

Abstract: We introduce a new real-valued belief network, which is a multilayer generalization of independent factor analysis (IFA). At each level, this network extracts real-valued latent variables that are non-linear functions of the input data with a highly adaptive functional form, resulting in a hierarchical distributed representation of these data. The network is based on a probabilistic generative model, constructed by cascading single-layer IFA models. Whereas exact maximum-likelihood learning for this model is intractable, we present and demonstrate an algorithm that maximizes a lower bound on the likelihood. This algorithm is developed by formulating a variational approach to hierarchical IFA networks.

Availability: PDF

#### Pattern discovery via entropy minimization

Abstract: We propose a framework for learning hidden-variable models by optimizing entropies, in which entropy minimization, posterior maximization, and free energy minimization are all equivalent. Solutions for the maximum a posteriori (MAP) estimator yield powerful learning algorithms that combine all the charms of expectation-maximization and deterministic annealing. Contained as special cases are the methods of maximum entropy, maximum likelihood, and a new method, maximum structure. We focus on the maximum structure case, in which entropy minimization maximizes the amount of evidence supporting each parameter while minimizing uncertainty in the sufficient statistics and cross-entropy between the model and the data. In iterative estimation, the MAP estimator gradually extinguishes excess parameters, sculpting a model structure that reflects hidden structures in the data. These models are highly resistant to over-fitting and have the particular virtue of being easy to interpret, often yielding insights into the hidden causes that generate the data.

Availability: PDF

#### Causal Mechanisms and Classification Trees for Predicting Chemical Carcinogens

Author: Louis Anthony (“Tony”) Cox, Jr., Cox Associates

Abstract: Classification trees, usually used as a nonlinear, nonparametric classification method, can also provide a powerful framework for comparing, assessing, and combining information from different expert systems, by treating their predictions as the independent variables in a classification tree analysis. This paper discusses the applied problem of classifying chemicals as human carcinogens. It shows how classification trees can be used to compare the information provided by ten different carcinogen classification expert systems, construct an improved “hybrid” classification system from them, and identify cost-effective combinations of assays (the inputs to the expert systems) to use in classifying chemicals in future.

Availability: PDF

#### Conditional Products: An Alternative Approach to Conditional Independence

Authors: A. Philip Dawid, Department of Statistical Science, University College London; Milan StudenyInstitute of Information Theory and Automation, Academy of Sciences of Czech Republic, and Laboratory of Intelligent Systems, University of Economics Prague

Abstract: We introduce a new abstract approach to the study of conditional independence, founded on a concept analogous to the factorization properties of probabilistic independence, rather than the separation properties of a graph. The basic ingredient is the “conditional product”, which provides a way of combining the basic objects under consideration while preserving as much independence as possible. We introduce an appropriate axiom system for conditional product, and show how, when these axioms are obeyed, they induce a derived concept of conditional independence which obeys the usual semi-graphoid axioms. The general structure is used to throw light on three specific areas: the familiar probabilistic framework (both the discrete and the general case); a set-theoretic framework related to “variation independence”; and a variety of graphical frameworks.

Availability: PDF

#### Geometric Modeling of a Nuclear Environment

Authors: Jan De Geeter and Marc Decréton, SCK.CEN (Belgian Nuclear Research Centre); Joris De Schutter, Herman Bruyninckx, and Hendrik Van Brussel, Department of Mechanical Engineering, Division PMA, Katholieke Universiteit Leuven

Abstract: This paper is about the task-directed updating of an incomplete and inaccurate geometric model of a nuclear environment, using only robust radiation-resistant sensors installed on a robot that is remotely controlled by a human operator. In this problem, there are many sources of uncertainty and ambiguity. This paper proposes a probabilistic solution under Gaussian assumptions. Uncertainty is reduced with an estimator based on a Kalman filter. Ambiguity on the measurement-feature association is resolved by running a bank of those estimators in parallel, one for each plausible association. The residual errors of these estimators are used for hypothesis testing and for the calculation of a probability distribution over the remaining hypotheses. The best next sensing action is calculated as a Bayes decision with respect to a loss function that takes into account both the uncertainty on the current estimate, and the variance/precision required by the task.

#### Process-Oriented Evaluation: The Next Step

Author: Pedro Domingos, Artificial Intelligence Group, Instituto Superior Técnico

Abstract: Methods to avoid overfitting fall into two broad categories: data-oriented (using separate data for validation) and representation-oriented (penalizing complexity in the model). Both have limitations that are hard to overcome. We argue that fully adequate model evaluation is only possible if the search process by which models are obtained is also taken into account. To this end, we recently proposed a method for process-oriented evaluation (POE), and successfully applied it to rule induction (Domingos, 1998). However, for the sake of simplicity this treatment made two rather artificial assumptions. In this paper the assumptions are removed, and a simple formula for model evaluation is obtained. Empirical trials show the new, better-founded form of POE to be as accurate as the previous one, while further reducing theory sizes.

Availability: PDF

#### Modeling Decision Tree Performance with the Power Law

Authors: Lewis J. Frey and Douglas H. Fisher, Jr., Computer Science Department, Vanderbilt University

Abstract: This paper discusses the use of a power law to predict decision tree performance. Power laws are fit to learning curves of decision trees trained on data sets from the UCI repository. The learning curves are generated by training C4.5 on different size training sets. The power law predicts diminishing returns in terms of error rate as training set size increase. By characterizing the learning curve with a power law, the error rate for a given size training set can be projected. This projection can be used in estimating the amount of data needed to achieve an acceptable error rate, and the cost effectiveness of further data collection.

Availability: PDF

#### Efficient Learning using Constrained Sufficient Statistics

Authors: Nir FriedmanInstitute of Computer Science, The Hebrew University; Lise GetoorComputer Science Department, Stanford University

Abstract: Learning Bayesian networks is a central problem for pattern recognition, density estimation and classification. In this paper, we propose a new method for speeding up the computational process of learning Bayesian network structure. This approach uses constraints imposed by the statistics already collected from the data to guide the learning algorithm. This allows us to reduce the number of statistics collected during learning and thus speed up the learning time. We show that our method is capable of learning structure from data more efficiently than traditional approaches. Our technique is of particular importance when the size of the datasets is large or when learning from incomplete data. The basic technique that we introduce is general and can be used to improve learning performance in many settings where sufficient statistics must be computed. In addition, our technique may be useful for alternate search strategies such as branch and bound algorithms.

Availability: PDF

#### On the geometry of DAG models with hidden variables

Authors: Dan Geiger, David Heckerman, and Christopher Meek, Decision Theory & Adaptive Systems, Microsoft Research; Henry KingMathematics Department, University of Maryland

Abstract: We prove that many graphical models with hidden variables are not curved exponential families. This result, together with the fact that some graphical models are curved and not linear, implies that the hierarchy of graphical models, as linear, curved, and stratified, is non-collapsing; each level in the hierarchy is strictly contained in the larger levels. This result is discussed in the context of model selection of graphical models.

#### Model Choice: A minimum posterior predictive loss approach

Authors: Sujit. Ghosh, Department of Statistics, NC State University; Alan E. GelfandDepartment of Statistics, University of Connecticut

Abstract: Model choice is a fundamental activity in the analysis of data sets, an activity which has become increasingly more important as computational advances enable the fitting of increasingly complex models. Such complexity typically arises through hierarchical structure which requires specification at each stage of probabilistic mechanisms, mean and dispersion forms, explanatory variables, etc. Nonnested hierarchical models introducing random effects may not be handled by classical methods. Bayesian approaches using predictive distributions can be used though the FORMAL solution, which includes Bayes factors as a special case, can be criticized. It seems natural to evaluate model performance by comparing what it predicts with what has been observed. Most classical criteria utilize such comparison. We propose a predictive criterion where the goal is good prediction of a replicate of the observed data but tempered by fidelity to the observed values. We obtain this criterion by minimizing posterior loss for a given model and then, for models under consideration, selecting the one which minimizes this criterion. For a version of log scoring loss we can do the minimization explicitly, obtaining an expression which can be interpreted as a penalized deviance criterion. We illustrate its performance with an application to a large data set involving residential property transactions.

Availability: PDF

#### Mean Field Inference in a General Probabilistic Setting

Authors: Michael Haft, Reimar Hofmann, and Volker Tresp, Siemens AG, Corporate Technology, Information and Communications Department

Abstract: We present a systematic, model-independent formulation of  mean field theory (MFT) as an inference method in probabilistic models. “Model-independent” means that we do not assume a particular type of dependency among the variables of a domain but instead work in a general probabilistic setting. In a Bayesian network, for example, you may use arbitrary tables to specify conditional dependencies and thus run MFT in any Bayesian network. Furthermore, the general mean field equations derived here shed a light on the essence of MFT. MFT can be interpreted as a local iteration scheme which relaxes in a consistent state (a solution of the mean field equations). Iterating the mean field equations means propagating information through the network. In general, however, there are multiple solutions to the mean field equations. We show that improved approximations can  be obtained by forming a weighted mixture of the multiple mean field solutions.  Simple approximate expressions for the mixture weights are given. The benefits of taking into account multiple solutions are demonstrated by using MFT for inference in a small Bayesian network representing a medical domain. Thereby it turns out that every solution of the mean field equations can be interpreted as a ‘disease scenario’.

Availability: PDF

#### Probabilistic kernel regression models

Authors: Tommi S. Jaakkola, Department of Computer Science and Electrical Engineering, Massachusetts Institute of Technology; David Haussler, Department of Computer Science, University of California–Santa Cruz

Abstract: We introduce a class of flexible conditional probability models and techniques for classification/regression problems. Many existing methods such as generalized linear models and support vector machines are subsumed under this class. The flexibility of this class of techniques comes from the use of kernel functions as in support vector machines, and the generality from dual formulations of standard regression models.

*The work was done while T. Jaakkola was at UC Santa Cruz.

#### Hierarchical Mixtures-of-Experts for Generalized Linear Models: Some Results on Denseness and Consistency

Authors: Wenxin Jiang and Martin A. Tanner, Department of Statistics, Northwestern University

Abstract: We investigate a class of hierarchical mixtures-of-experts (HME) models where exponential family regression models with generalized linear mean functions of the form $\psi(a+x^T b)$ are mixed. Here $\psi(\cdot)$ is the inverse link function. Suppose the true response $y$ follows an exponential family regression model with mean function belonging to a class of smooth functions of the form $\psi(h(x))$ where $h \in W_{2;K_0}^\infty$ (a Sobolev class over $[0,1]^{s}$). It is shown that the HME mean functions can approximate the true mean function, at a rate of $O(m^{-2/s})$ in $L_p$ norm. Moreover, the HME probability density functions can approximate the true density, at a rate of $O(m^{-2/s})$ in Hellinger distance, and at a rate of $O(m^{-4/s})$ in Kullback-Leibler divergence. These rates can be achieved within the family of HME structures with a tree of binary splits, or within the family of structures with a single layer of experts. Here $s$ is the dimension of the predictor $x$. It is also shown that likelihood-based inference based on HME is consistent in recovering the truth, in the sense that as the sample size $n$ and the number of experts $m$ both increase, the mean square error of the estimated mean response goes to zero. Conditions for such results to hold are stated and discussed.

Availability: PDF

#### Stochastic local search for Bayesian networks

Authors: Kalev Kask and Rina Dechter, Department of Information and Computer Science, University of California–Irvine

Abstract: The paper evaluates empirically the suitability of Stochastic Local Search algorithms (SLS) for finding most probable explanations in Bayesian networks. SLS algorithms (e.g. GSAT, WSAT) have recently proven to be highly effective in solving complex constraint-satisfaction and satisfiability problems which cannot be solved by traditional search schemes. Our experiments investigate the applicability of this scheme to probabilistic optimization problems.Specifically, we show that algorithms combining hill-climbing steps with stochastic steps (guided by the network’s probability distribution) called G+StS, outperform pure hill-climbing search, pure stochastic simulation search, as well as simulated annealing. In addition, variants of G+StS that are augmented on top of alternative approximation methods are shown to be particularly effective.

Availability: PDF

#### Bayesian Graphical Models, Intention-to-Treat, and the Rubin Causal Model

Author: David Madigan, Department of Statistics, University of Washington

Abstract: In clinical trials with significant noncompliance the standard intention-to-treat analyses sometimes mislead. Rubin’s causal model provides an alternative method of analysis that can shed extra light on clinical trial data. Formulating the Rubin Causal Model as a Bayesian graphical model facilitates model communication and computation.

Availability: PDF

#### Efficient Mining of Statistical Dependencies

Authors: Tim Oates, Matthew D. Schmill, Paul R. Cohen, and Casey Durfee, Experimental Knowledge Systems Lab, Department of Computer Science, University of Massachusetts–Amherst

Abstract: The Multi-Stream Dependency Detection algorithm finds rules that capture statistical dependencies between patterns in multivariate time series of categorical data. Rule strength is measured by the G statistic, and an upper bound on the value of G for the descendants of a node allows MSDD’s search space to be pruned. However, in the worst case, the algorithm will explore exponentially many rules. This paper presents and empirically evaluates two ways of addressing this problem. The first is a set of three methods for reducing the size of MSDD’s search space based on information collected during the search process. Second, we discuss an implementation of MSDD that distributes its computations over multiple machines on a network.

#### Tractable structure search in the presence of latent variables

Authors: Thomas Richardson, Heiko Bailer, and Moulinath Banerjees, Department of Statistics, University of Washington

Abstract: The problem of learning the structure of a DAG model in the presence of latent variables presents many formidable challenges. In particular there are an infinite number of latent variable models to consider, and these models possess features which make them hard to work with. We describe a class of graphical models which can represent the conditional independence structure induced by a latent variable model over the observed margin. We give a parametrization of the set of Gaussian distributions with conditional independence structure given by a MAG model. The models are illustrated via a simple example. Different estimation techniques are discussed in the context of Zellner’s Seemingly Unrelated Regression (SUR) models.

#### Boosting Methodology for Regression Problems

Authors: Greg Ridgeway, David Madigan, and Thomas RichardsonDepartment of Statistics, University of Washington

Abstract: Classification problems have dominated research on boosting to date. The application of boosting to regression problems, on the other hand, has received little investigation. In this paper we develop a new boosting method for regression problems. We cast the regression problem as a classification problem and apply an interpretable form of the boosted naïve Bayes classifier. This induces a regression model that we show to be expressible as an additive model for which we derive estimators and discuss computational issues. We compare the performance of our boosted naïve Bayes regression model with other interpretable multivariate regression procedures.

Availability: PDF

#### An Experiment in Causal Inference Using a Pneumonia Database

Authors: Peter Spirtes, Department of Philosophy, Carnegie Mellon University; Greg Cooper, Center for Biomedical Informatics, University of Pittsburgh

Abstract: We tested a causal discovery algorithm on a database of pneumonia patients. The output of the causal discovery algorithm is a list of statements “A causes B”, where A and B are variables in the database, and a score indicating the degree of confidence in the statement. We compared the output of the algorithm with the opinions of physicians about whether A caused B or not. We found that the doctors opinions were independent of the output of the algorithm. However, an examination of the output of results suggested a simple, well motivated modification of the algorithm which would bring the output of the algorithm into high agreement with the physicians opinions.

Availability: PDF

#### A Note on the Comparison of Polynomial Selection Methods

Authors: Murlikrishna Viswanathan and Professor Chris Wallace, School of Computer Science and Software Engineering, Monash University–Clayton

Abstract: Minimum Message Length (MML) and Structural Risk Minimisation (SRM) are two computational learning principles that have achieved wide acclaim in recent years. Whereas the former is based on Bayesian learning and the latter on the classical theory of VC-dimension, they are similar in their attempt to define a trade-off between model complexity and goodness of fit to the data. A recent empirical study by Wallace compared the performance of standard model selection methods in a one-dimensional polynomial regression framework. The results from this study provided strong evidence in support of the MML and SRM based methods over the other standard approaches. In this paper we present a detailed empirical evaluation of three model selection methods which include an MML based approach and two SRM based methods. Results from our analysis and experimental evaluation suggest that the MML-based approach in general has higher predictive accuracy and also raise questions on the inductive capabilities of the Structural Risk Minimization Principle.

## Poster Sessions | January 4

#### Transfer of Information between System and Evidence Models

Authors: Russell Almond, Research Statistics Group at Educational Testing Service; Edward Herskovits, Noetic Systems, Inc.; Robert J. Mislevy, Model Based Measurement Group at Educational Testing Service; Linda Stienberg, Educational Policy Research at Educational Testing Service

Abstract: In this paper we illustrate a simple scheme for dividing a complex Bayes network into a system model and a collection of smaller evidence models. While the system model maintains a permanent record of the state of the system of interest, the evidence models are only used momentarily to absorb evidence from specific observations or findings and then discarded. This paper describes an implementation of a system model—evidence model complex in which each system and evidence model has a separate Bayes net and Markov tree representation. As necessary, information is propagated between common Markov tree nodes of the evidence and system models. While mathematically equivalent to the full Bayes network, the system model–evidence model complex allows us to (a) separate the seldom used evidence model portions from the core system model thus reducing search and propagation time in the network and (b) easily replace the evidence models (this is particular advantageous in educational examples in which new test items are often introduced to prevent overexposure of assessment tasks).

#### A Bayesian Model for Collaborative Filtering

Authors: Yung-Hsin Chien and Edward I. George, Department of MSIS at University of Texas at Austin

Abstract: Consider the general setup where a set of items have been partially rated by a set of judges, in the sense that not every item has been rated by every judge. For this setup, we propose a Bayesian approach for the problem of predicting the missing ratings from the observed ratings. This approach incorporates similarity by assuming the set of judges can be partitioned into groups which share the same ratings probability distribution. This leads to a predictive distribution of missing ratings based on the posterior distribution of the groupings and associated ratings probabilities. Markov chain Monte Carlo methods and a hybrid search algorithm are then used to obtain predictions of the missing ratings.

#### Parameter learning from incomplete data for Bayesian networks

Author: RG Cowell

Abstract: In a companion paper (Cowell 1999), I described a method of using maximum entropy to estimate the joint probability distribution for a set of discrete variables from missing data. Here I extend the method of that paper to incorporate prior information for application to parameter learning in Bayesian networks.

#### On the Application of The Bootstrap for Computing Confidence Measures on Features of Induced Bayesian Networks

Authors: Nir FriedmanInstitute of Computer Science at Hebrew University; Moises Goldszmidt, SRI International; Abraham Wyner, Department of Statistics at Wharton School

Abstract: In the context of learning Bayesian networks from data, very little work has been published on methods for assessing the quality of an induced model. This issue, however, has received a great deal of attention in the statistics literature. In this paper, we take a well-known method from statistics, Efron’s Bootstrap, and examine its applicability for assessing a confidence measure on features of the learned network structure. We also compare this method to assessments based on a practical realization of the Bayesian methodology.

#### Relaxing the Local Independence Assumption for Quantitative Learning in Acyclic Directed Graphical Models through Hierarchical Partition Models

Authors: Daniela Golinelli and David Madigan, Department of Statistics at University of Washington and Guido Consonni, Dip. di Economia e Metodi Quantitativi at Universita’ di Pavia

Abstract: The simplest method proposed by Spiegelhalter and Lauritzen (1990) to perform quantitative learning in ADG presents a potential weakness: the local independence assumption. We propose to alleviate this problem through the use of Hierarchical Partition Models. Our approach is compared with the previous one from an interpretative and predictive point of view.

#### The exploration of new methods for learning in binary Boltzmann machines

Authors: Keith Humphreys and D.M. Titterington, Department of Statistics at University of Glasgow

Abstract: Exact inference for Boltzmann machines is computationally expensive. One approach to improving tractability is to approximate the gradient algorithm. We describe a new way of doing this which is based on Bahadur’s representation of the multivariate binary distribution (Bahadur, 1961). We compare the approach, for networks with no unobserved variable, to the “mean field” approximation of Peterson and Anderson (1987) and the approach of Kappen and Rodriguez (1998), which is based on the linear response theorem. We also investigate the use of the pairwise association cluster method (Tanaka and Morita, 1995).

#### Statistical Challenges to inductive inference in linked data

Author: David Jensen

Abstract: Many data sets can be represented naturally as collections of linked objects. For example, document collections can be represented as documents (nodes) connected by citations and hypertext references (links). Similarly, organizations can be represented as people (nodes) connected reporting relationships, social relationships, and communication patterns (links).

#### Mixture Model Clustering with the Multimix Program

Authors: Murray A. Jorgensen and Lynette A. Hunt, Department of Statistics at University of Waikato

Abstract: Hunt (1996) has implemented the finite mixture model approach to clustering in a program called Multimix. The program is designed to cluster multivariate data with categorical and continuous variables and possibly containing missing values. The model fitted simultaneously generalises the Latent Class model and the mixture of multivariate normals model. Like either of these models Multimix can be used to form clusters by the Bayes allocation rule. This is the intended use of the program, although the parameter estimates can be used to give a succinct description of the clusters.

Use of the EM algorithm, with its view of the observed data as being notionally augmented by missing information to form the ‘complete data’, gives a broad framework for estimation which is able to handle two types of missing information: unknown cluster assignment and missing data. Using the methodology of Little and Rubin (1987). in this way Multimix is able to handle missing data in a less ad hoc way than many clustering algorithms. The program runs in acceptable time with large data matrices (say hundreds of observations on tens of variables). Use of the missing-data facility increases execution time somewhat. In this presentation we describe the approach taken to the design of Multimix and how some of the statistical problems were dealt with. As examples of the use of the program we cluster a large medical dataset and a version of Fisher’s Iris data in which a third of the values are randomly made ‘missing’.

#### Learning Augmented Bayesian Classifiers: A Comparison of Distribution-based and Classification-based Approaches

Authors: Eamonn Keogh and Michael J. Pazzani, Information and Computer Science at University of California, Irvine

Abstract: The nave Bayes classifier is built on the assumption of conditional independence between the attributes given the class. The algorithm has been shown to be surprisingly robust to obvious violations of this condition, but it is natural to ask if it is possible to further improve the accuracy by relaxing this assumption. We examine an approach where nave Bayes is augmented by the addition of correlation arcs between attributes. We explore two methods for finding the set of augmenting arcs, a greedy hill-climbing search, and a novel, more computationally efficient algorithm that we call SuperParent. We compare these methods to TAN; a state-of the-art distribution-based approach to finding the augmenting arcs.

#### Exploring the robustness of Bayesian and information-theoretic methods for predictive inference

Authors: Petri Kontkanen, Petri Myllymäki, Tomi Silander, Henry Tirri, and Kimmo Valtonen

Abstract: Given a set of sample data, we study three alternative methods for determining the predictive distribution of an unseen data vector. In particular, we are interested in the behavior of the predictive accuracy of these three predictive methods as a function of the degree of the domain assumption violations. We explore this question empirically by using artificially generated data sets, where the assumptions can be violated in various ways. Our empirical results suggest that if the model assumptions are only mildly violated, marginalization over the model parameters may not be necessary in practice. This is due to the fact that in this case the computationally much simpler predictive distribution based on a single, maximum posterior probability model shows similar performance as the computationally more demanding marginal likelihood approach. The results also give support to Rissanen’s theoretical results about the usefulness of using Jereys’ prior distribution for the model parameters.

Availability: PDF

#### Structure optimization of density estimation models applied to regression problems with dynamic noise

Authors: Martin Kreutz, Bernhard Sendhoff, and Werner von Seelen, Institut für Neuroinformatik at Ruhr-Universität Bochum; Anja M. Reimetz and Claus Weihs, Fachbereich Statistik at Universität Dortmund

Abstract: In this paper we deal with the problem of model selection for time series forecasting with dynamical noise and missing data. We employ an evolutionary algorithm to the optimization of a mixture of densities model in order to estimate, via a log-likelihood based quality measure, the joint probability density of the data. We apply our method to the prediction of both artificial time series, generated from the Mackey-Glass equation, and time series from a real world system consisting of physiological data of apnea patients.

Related work from the authors:

• Martin Kreutz, Anja M. Reimetz, Bernhard Sendhoff, Claus Weihs and Werner von Seelen. Optimisation of Density Estimation Models with Evolutionary Algorithms. In A.E. Eiben, Th. Bäck, M. Schoenauer and H.P. Schwefel, editors, Parallel Problem Solving from Nature – PPSN V, pages 998-1007, Lecture Notes in Computer Science 1498, Springer, 1998.

#### A learning rule based method of feature extraction with application to acoustic signal classification

Author: M.J. Larkin, The Institute for Brain and Neural Systems at Brown University

Abstract: We apply the Bienenstock, Cooper, and Munro (1982) theory of visual cortical plasticity to the problem of extracting features (i.e., reduction of dimensionality) from acoustic signals; in this case, labeled samples of marine mammal sounds. We first implemented BCM learning in a single neuron model, trained the neuron on samples of acoustic data, and then observed the response when the neuron was tested on different classes of acoustic signals. Next, a multiple neuron network was constructed, with lateral inhibition among the neurons. By training neurons to be selective to inherent features in these signals, we are able to develop networks which can then be used in the design of an automated acoustic signal classifier.

#### Learning Extensible Multi-Entity Directed Graphical Models

Author: Kathryn Blackmond Laskey, Department of Systems Engineering and Operations Research at George Mason University

Abstract: Graphical models have become a standard tool for representing complex probability models in statistics and artificial intelligence. In problems arising in artificial intelligence, it is useful to use the belief network formalism to represent uncertain relationships among variables in the domain, but it may not be possible to use a single, fixed belief network to encompass all problem instances. This is because the number of entities to be reasoned about and their relationships to each other varies from problem instance to problem instance. This paper describes a framework for representing probabilistic knowledge as fragments of belief networks and an approach to learning both structure and parameters from observations.

#### A latent variable model for multivariate discretization

Authors: Stefano Monti, Intelligent Systems Program at University of Pittsburgh and Gregory F. Cooper, Center for Biomedical Informatics at University of Pittsburgh

Abstract: We describe a new method for multivariate discretization based on the use of a latent variable model. The method is proposed as a tool to extend the scope of applicability of machine learning algorithms that handle discrete variables only. Building upon existing class-based discretization methods, we use a latent variable as a proxy class variable, which is then utilized to drive the partition of the value range of each continuous variable. We present experimental results on simulated data aimed at assessing the merits of the proposed method.

#### Testing Regression Models With Fewer Regressors

Authors: Judea Pearl and Peyman Meshkat

Abstract: A BASIS for a model M is a minimal set of tests that, if satisfied, implies the satisfaction of all the assumptions behind M. This paper proposes a graphical procedure of recognizing bases of regression models. Using this precedure, it is possible to select a set of tests in which the number of regressors is small, compared with standard tests, thus resulting in improved power.

#### Learning Conditional Probabilities from Incomplete Databases - An Experimental Comparison

Authors: Marco Ramoni and Paola Sebastiani, The Open University

Abstract: This paper compares three methods – the EM algorithm, Gibbs sampling, and Bound and Collapse (BC) – to estimate conditional probabilities from incomplete databases in a controlled experiment. Results show a substantial equivalence of the estimates provided by the three methods and a dramatic gain in efficiency using BC.

Other information: Further information is available from the home page of the Bayesian Knowledge Discovery project at The Open University.

#### Local Experts Combination through Density Decomposition

Authors: Ahmed Rida, Abderrahim Labbi, and Christian Pellegrini, Artificial Intelligence group at Centre Universitaire d’Informatique

Abstract: In this paper we describe a divide-and-combine strategy for decomposition of a complex prediction problem into simpler local sub-problems. We firstly show how to perform a soft decomposition via clustering of input data. Such decomposition leads to a partition of the input space into several regions which may overlap. Therefore, to each region is assigned a local predictor (or expert) which is trained only on local data. To construct a solution to the global prediction problem, we combine the local experts using two approaches: weighted averaging where the outputs of local experts are weighted by their prior densities, and nonlinear adaptive combination where the pooling parameters are obtained through minimization of a global error. To illustrate the validity of our approach, we show simulation results for two classification tasks, vowels and phonemes, using local experts which are Multi-Layer Perceptrons (MLP) and Support Vector Machines (SVM). We compare the results obtained using the two local combination modes with the results obtained using a global predictor and a linear combination of global predictors.

#### Entropy-Driven Inference and Inconsistency

Authors: Wilhelm Rödder and Longgui Xu, FernUniversität Gesamthochschule in HagenFachbereich Wirschaftswissenschaft, Lehrstuhl für Betriebswirtschaftslehre, insb. Operations Research

Abstract: Probability distributions on a set of discrete variables are a suitable means to represent knowledge about their respective mutual dependencies. When now things become evident such a distribution can be adapted to the new situation and hence submitted to a sound inference process. Knowledge acquisition and inference are here performed in the rich syntax of conditional events. Both, acquisition and inference respect a sophisticated principle, namely that of maximum entropy and of minimum relative entropy. The freedom to formulate and derive knowledge in a language of rich syntax is comfortable but involves the danger of contradictions or inconsistencies. We develop a method how to solve such inconsistencies which go back to the incompatibility of experts’ knowledge in their respective branches. The method is applied to the diagnosis in Chinese medicine. All calculations are performed in the Entropy-driven expert system shell SPIRIT.

Availability: PDF

References:

[1] I. Csiszár: I-Divergence Geometry of Probability Distributions and Minimisation Problems. The Annals of Probability 3, (1): 146 – 158 (1975).

[2] I. Csiszár: Why Least Squares and Maximum Entropy? An Axiomatic Approach to Inference for Linear Inverse Problems. The Annals of Statistics 19 (4): 2032 – 2066 (1991).

[3] G. Kern-Isberner: Characterising the principle of minimum cross-entropy within a conditional-logical framework. Artificial Intelligence 98: 169-208 (1998).

[4] S. L. Lauritzen: Graphical Association Models (Draft), Technical Report IR 93-2001, Institute for Electronic Systems, Dept. of Mathematics and Computer Science, Aalborg University (1993).

[5] W. Rödder and G. Kern-Isberner: Representation and extraction of information by probabilistic logic. Information Systems 21 (8): 637 – 652 (1996).

[6] W. Rödder and C.-H. Meyer: Coherent knowledge processing at maximum entropy by SPIRIT, Proceedings 12th Conference on Uncertainty in Artificial Intelligence, E. Horitz and F. Jensen (editors), Morgan Kaufmann, San Francisco, California: 470 – 476 (1996).

[7] C. C. Schnorrenberger: Lehrbuch der chinesischen Medizin für westliche Ärzte, Hippokrates, Stuttgart (1985).

[8] C. E. Shannon: A mathematical theory of communication, Bell System Tech. J. 27, 379-423 (part I), 623 – 656 (part II) (1948).

[9] J. E. Shore and R. W. Johnson: Axiomatic Derivation of the Principle of Maximum Entropy and the Principle of Minimum Cross Entropy. IEEE Trans. Information Theory 26 (1): 26 – 37 (1980).

[10] J. Whittaker: Graphical Models in Applied Mathematical Multivariate Statistics, John Wiley & Sons (1990).

#### Learned Models for Continuous Planning

Authors: Matthew D. Schmill, Tim Oates, and Paul R. Cohen, Experimental Knowledge Systems Laboratory at University of Massachusetts

Abstract: We are interested in the nature of activity — structured behavior of nontrivial duration — in intelligent agents. We believe that the development of activity is a continual process in which simpler activities are composed, via planning, to form more sophisticated ones in a hierarchical fashion. The success or failure of a planner depends on its models of the environment, and its ability to implement its plans in the world. We describe an approach to generating dynamical models of activity from real-world experiences and explain how they can be applied towards planning in a continuous state space.

#### Efficient Optimization of Large k Real-time Control Algorithm

Authors: Delphi Delco Electronics Systems, Restraint Systems Electronics and Daniel H. Loughlin, North Carolina State University

Abstract: Resource requirements for global optimization increase dramatically with the number of real-valued decision variables (k). Efficient search strategies are needed to satisfy constraints of time, effort, and funding. In this paper, a conjunction of several disparate methods is used to automatically calibrate a non-linear real-time control used in the automotive industry. By combining a response surface methodology with a hybrid genetic algorithm search, air-bag deployment calibrations can be automated, producing solutions superior to conventional manual search.

#### Model Folding for Data Subject to Nonresponse

Authors: Paola Sebastiani, The Open University and Marco Ramoni, Knowledge Media Institute at The Open University

Abstract: In this paper we introduce a deterministic method to estimate the posterior probability of rival models from data with partially ignorable nonresponse.  The accuracy of the method will be shown via an application to synthetic data.

Other information: Futher information is available from the home page of the Bayesian Knowledge Discovery project at The Open University.

#### Geometry, Moments and Bayesian Networks with Hidden Variables

Authors: Raffaella Settimi, University of Warwick and Jim Q. Smith, University of Warwick

Abstract: The purpose of this paper is to present a systematic way of analysing the geometry of the probability spaces for a particular class of Bayesian networks with hidden variables. It will be shown that the conditional independence statements implicit in such graphical models can be neatly expressed as simple polynomial relationships among central moments. This algebraic framework  will enable us to explore and identify the structural constraints on the sample space induced by  models with tree structures and therefore characterise the families of distributions consistent with such conditional independence assumptions.

#### Joint probabilistic clustering of multivariate and sequential data

Author: Padhraic Smyth, University of California–Irvine

Consider the following problem. We have a set of individuals (a random sample from a larger population) whom we would like to cluster into groups based on observational data. For each individual we can measure characteristics which are relatively static (eg, their height, weight, income, age, sex, etc). Probabilistic model-based clustering in this contextusually takes the form of a nite mixture model, where each component in the mixture is a multivariate probability density function (or distribution function) for a particular group. This approach has been found to be a useful general technique for extracting hidden structure from multivariate data (Ban eld and Raftery, 1993; Thiesson et al, 1997).

#### Analysis of multivariate time series via a hidden graphical model

Authors: Elena Stanghellini, Universita’ di Perugia and Joe Whittaker, Lancaster University

Abstract: We propose a chain graph with unobserved variables to model a multivariate time series. We assume that an underlying common trend linearly affects the observed time series, but we do not restrict our analysis to models where the underlying factor accounts for all the contemporary correlations of the series. The residual correlation is modelled using results of graphical models. Modelling the associations left unexplained is an alternative to augmenting the dimension of the underlying factor. It is justified when a clear interpretation of the residual associations is available. It is also an informative way to explore sources of deviation from standard dynamic single-factor models.

#### Visual design support for probabilistic network application

Author: Axel Vogler, Daimler Benz AG

Abstract: Understanding inference in probabilistic networks is an important point in the design phase. Their causal structure and locally defined parameters are intuitive to human experts. The global system induced by the local parameters can lead to results not intended by the human experts. To support network design an edge coloring scheme explaining influences between variables responsible for inference result is introduced.