Home
Uncertainty 99 was the Seventh International Workshop on Artificial Intelligence and Statistics and was presented by The Society for Artificial Intelligence & Statistics.
Program Committee
 Russell Almond, Educational Testing Service
 Chris Bishop, Microsoft Research
 Wray Buntine, Ultimode Systems
 Peter Cheeseman, NASA Ames
 Max Chickering, Microsoft Research
 Paul Cohen, University of Massachusetts
 Greg Cooper, University of Pittsburgh
 Philip Dawid, University College London
 David Dowe, Monash University
 William DuMouchel, AT&T Labs
 Sue Dumais, Microsoft Research
 David Edwards, Novo Nordisk
 Doug Fisher, Vanderbilt University
 Nir Friedman, Hebrew University–Jerusalem
 Dan Geiger, Technion
 Edward George, University of Texas
 Clark Glymour, CarnegieMellon University
 Moises Goldszmidt, SRI International
 David Hand, Open University
 Geoff Hinton, University of Toronto
 Tommi Jaakkola, MIT
 Michael Jordan, UC Berkeley
 Michael Kearns, AT&T Labs
 Daphne Koller, Stanford University
 Steffen Lauritzen, Aalborg University
 Hans Lenz, Free University of Berlin
 David Lewis, AT&T Labs
 David Madigan, University of Washington
 Andrew Moore, CarnegieMellon University
 Daryl Pregibon, AT&T Labs
 Thomas Richardson, University of Washington
 Alberto Roverato, Universita di Modena
 Lawrence Saul, AT&T Labs
 Ross Shachter, Stanford University
 Richard Scheines, CarnegieMellon University
 Sebastian Seung, MIT
 Prakash Shenoy, University of Kansas
 Padhraic Smyth, UC Irvine
 David Spiegelhalter, MRC–Cambridge
 Peter Spirtes, CarnegieMellon University
 Milan Studeny, Academy of Sciences of Czech Republic
 Nanny Wermuth, Mainz University
Program
Monday, January 4
Time  Session  Speaker  

7:30–8:45

Registration/Continental Breakfast 

8:45–9:00 
Opening Comments 
David Heckerman and Joe Whittaker  
9:00–11:00

Session I: Model Choice 

Processoriented evaluation: The next step  Pedro Domingos  
Model choice  Alan Gelfand and Sujit Ghosh  
A note on the comparison of polynomial selection methods  Murlikrishna Viswanathan  
Pattern discovery via entropy minimization  Matthew Brand  
11:00–11:30  Break  
11:30–12:30  Session II: Latent variables Chair: Kathyrn Laskey 

On the geometry of DAG models with hidden variables  Dan Geiger, David Heckerman, Henry King, Chris Meek  
Efficient structure search in the presence of latent variables  Thomas Richardson, Heiko Bailer, Mooulinath Bannerjee  
12:30–1:30  Lunch (provided)  
1:30–5:00  Break  
5:00–6:00  Poster Summaries (2 mins/poster) Chair: Joe Whittaker 

6:00–7:00  Dinner  
7:00–9:30  Poster Sessions 
Tuesday, January 5
Time  Session  Speaker  

8:00–9:00

Continental Breakfast 

9:00–10:00 
Session III: Theory 

Conditional products: an alternative approach to conditional independence 
Phil Dawid, Milan Studeny  
Hierarchical mixturesofexperts for generalized linear models: some results on denseness and consistency  Wenxin Jiang, Martin Tanner  
10:00–10:30  Coffee Break  
10:30–11:30  Session IV: Regression Chair: Padhraic Smyth 

Boosting methodology for regression problems  Greg Ridgeway, David Madigan, Thomas Richardson  
Probabilistic kernel regression models  Tommi Jaakkola, David Haussler  
11:30–12:30  Session V: Computational Methods Chair: Padhraic Smyth 

Learning structure from data efficiently: applying bounding techniques  Nir Friedman, Lise Getoor  
Efficient mining of statistical dependencies  Tim Oates, Paul Cohen, Casey Durfee  
12:30–2:00  Lunch (provided)  
2:00–3:30  Session VI: Applications  
Causal mechanisms and classification trees for predicting chemical carcinogens  Louis A Cox  
Geometric modelling of a nuclear environment  Jan De Geeter  
Modeling decision tree performance with the power law  Lewis Frey, Doug Fisher  
3:30–4:40  Business Meeting 
Wednesday, January 6
Time  Session  Speaker  

8:00–9:00

Continental Breakfast 

9:00–10:30 
Session VII: Inference 

Modelindependent mean field theory as a local method for approximate propagation of information 
Michael Haft, Reimar Hofmann, Volker Tresp  
Hierarchical IFA belief networks  Hagai Attias  
Stochastic local search for Bayesian network  Kalev Kask, Rina Dechter  
10:30–11:00  Coffee Break  
11:00–12:00  Session VIII: Applications Chair: Doug Fisher 

An experiment in causal discovery using a pneumona database  Peter Spirtes, Greg Cooper  
Bayesian graphical models for noncompliance in randomaized trials  David Madigan  
12:00–12:15  Closing Remarks 
Plenary Presentations lasted 25 minutes with 5 minutes for questions.
Tutorials & Abstracts
Tutorials  January 3
Information Access and Retrieval
Speaker: Susan Dumais, Microsoft Research  8:30 AM – 10:30 AM
The Web has made literally terabytes of information available at the click of a mouse. The challenge is in finding the right information. Information retrieval is concerned with providing access to textual data for which we have no good formal model, such as a relational model. Statistical approaches have been widely applied to this problem. This tutorial will provide an overview of: a) statistical characteristics of large text collections (e.g., size, sparcity, word distributions), b) important retrieval models (e.g., Boolean, vector space and probabilistic), and c) enhancements which use unsupervised learning to model structure in text collections, or supervised learning to incorporate user feedback. We will conclude with a discussion of open research issues where improved statistical models can improve performance.
Bayesian Statistical Analysis
Speaker: David Spiegelhalter, MRC Biostatistics Unit, Institute for Public Health, Cambridge  11:00 AM – 12:00 AM and 1:30 PM – 2:30 PM
The first part of the tutorial will cover the fundamentals of Bayesian inference, including probability and its subjective interpretation, evaluation of probability assessments using scoring rules, utilities and decision theory. The use of Bayes theorem for updating beliefs will be illustrated for both binomial and normal likelihoods, and the use of conjugate families of priors and predictive distributions described. The First Bayes software will be used to display conjugate Bayesian analysis. The second part will introduce the concept of `exchangeability’, and the consequent use of hierarchical models in which the unknown parameters of a common prior are included in the model. Conditional independence assumptions lead naturally to a graphical representation of hierarchical models. Markov chain Monte Carlo (MCMC) methods will be introduced as a means of carrying out the necessary numerical integrations, and topics covered will include the relationship of Gibbs sampling to graphical modelling, parameterisation, initial values, and choice of prior distributions. Real examples will be used throughout, and online analysis of an example in longitudinal modelling with measurement error on predictors will be carried out using the WinBUGS program.
Additive Logistic Regression: A Statistical View of Boosting
Speaker: Trevor Hastie, Stanford University  3:30 PM – 5:00 PM
Boosting (Freund and Schapire, 1995) is one of the most important recent developments in classification methodology. Boosting works by sequentially applying a classification algorithm to reweighted versions of the training data, and then taking a weighted majority vote of the sequence of classifiers thus produced. For many classification algorithms, this simple strategy results in dramatic improvements in performance. We show that this seemingly mysterious phenomenon can be understood in terms of well known statistical principles, namely additive modeling and maximum likelihood. For the twoclass problem, boosting can be viewed as an approximation to additive modeling on the logistic scale using maximum Bernoulli likelihood as a criterion. We develop more direct approximations and show that they exhibit nearly identical results to boosting. Directmulticlass generalizations based on multinomial likelihood are derived that exhibit performance comparable to other recently proposed multiclass generalizations of boosting in most situations, and far superior in some. We suggest a minor modification to boosting that can reduce computation, often by factors of 10 to 50. Finally, we apply these insights to produce an alternative formulation of boosting decision trees. This approach, based on bestfirst truncated tree induction, often leads to better performance, and can provide interpretable descriptions of the aggregate decision rule. It is also much faster computationally, making it more suitable to large scale data mining applications.
* joint work with Jerome Friedman and Rob Tibshirani
Abstracts  January 4 – 6
Hierarchical IFA Belief Networks
Author: Hagai Attias, Gatsby Computational Neuroscience Unit, University College London
Abstract: We introduce a new realvalued belief network, which is a multilayer generalization of independent factor analysis (IFA). At each level, this network extracts realvalued latent variables that are nonlinear functions of the input data with a highly adaptive functional form, resulting in a hierarchical distributed representation of these data. The network is based on a probabilistic generative model, constructed by cascading singlelayer IFA models. Whereas exact maximumlikelihood learning for this model is intractable, we present and demonstrate an algorithm that maximizes a lower bound on the likelihood. This algorithm is developed by formulating a variational approach to hierarchical IFA networks.
Availability: PDF
Pattern discovery via entropy minimization
Author: Matthew Brand, Mitsubishi Electric Research Labs
Abstract: We propose a framework for learning hiddenvariable models by optimizing entropies, in which entropy minimization, posterior maximization, and free energy minimization are all equivalent. Solutions for the maximum a posteriori (MAP) estimator yield powerful learning algorithms that combine all the charms of expectationmaximization and deterministic annealing. Contained as special cases are the methods of maximum entropy, maximum likelihood, and a new method, maximum structure. We focus on the maximum structure case, in which entropy minimization maximizes the amount of evidence supporting each parameter while minimizing uncertainty in the sufficient statistics and crossentropy between the model and the data. In iterative estimation, the MAP estimator gradually extinguishes excess parameters, sculpting a model structure that reflects hidden structures in the data. These models are highly resistant to overfitting and have the particular virtue of being easy to interpret, often yielding insights into the hidden causes that generate the data.
Availability: PDF
Causal Mechanisms and Classification Trees for Predicting Chemical Carcinogens
Author: Louis Anthony (“Tony”) Cox, Jr., Cox Associates
Abstract: Classification trees, usually used as a nonlinear, nonparametric classification method, can also provide a powerful framework for comparing, assessing, and combining information from different expert systems, by treating their predictions as the independent variables in a classification tree analysis. This paper discusses the applied problem of classifying chemicals as human carcinogens. It shows how classification trees can be used to compare the information provided by ten different carcinogen classification expert systems, construct an improved “hybrid” classification system from them, and identify costeffective combinations of assays (the inputs to the expert systems) to use in classifying chemicals in future.
Availability: PDF
Conditional Products: An Alternative Approach to Conditional Independence
Authors: A. Philip Dawid, Department of Statistical Science, University College London; Milan Studeny, Institute of Information Theory and Automation, Academy of Sciences of Czech Republic, and Laboratory of Intelligent Systems, University of Economics Prague
Abstract: We introduce a new abstract approach to the study of conditional independence, founded on a concept analogous to the factorization properties of probabilistic independence, rather than the separation properties of a graph. The basic ingredient is the “conditional product”, which provides a way of combining the basic objects under consideration while preserving as much independence as possible. We introduce an appropriate axiom system for conditional product, and show how, when these axioms are obeyed, they induce a derived concept of conditional independence which obeys the usual semigraphoid axioms. The general structure is used to throw light on three specific areas: the familiar probabilistic framework (both the discrete and the general case); a settheoretic framework related to “variation independence”; and a variety of graphical frameworks.
Availability: PDF
Geometric Modeling of a Nuclear Environment
Authors: Jan De Geeter and Marc Decréton, SCK.CEN (Belgian Nuclear Research Centre); Joris De Schutter, Herman Bruyninckx, and Hendrik Van Brussel, Department of Mechanical Engineering, Division PMA, Katholieke Universiteit Leuven
Abstract: This paper is about the taskdirected updating of an incomplete and inaccurate geometric model of a nuclear environment, using only robust radiationresistant sensors installed on a robot that is remotely controlled by a human operator. In this problem, there are many sources of uncertainty and ambiguity. This paper proposes a probabilistic solution under Gaussian assumptions. Uncertainty is reduced with an estimator based on a Kalman filter. Ambiguity on the measurementfeature association is resolved by running a bank of those estimators in parallel, one for each plausible association. The residual errors of these estimators are used for hypothesis testing and for the calculation of a probability distribution over the remaining hypotheses. The best next sensing action is calculated as a Bayes decision with respect to a loss function that takes into account both the uncertainty on the current estimate, and the variance/precision required by the task.
ProcessOriented Evaluation: The Next Step
Author: Pedro Domingos, Artificial Intelligence Group, Instituto Superior Técnico
Abstract: Methods to avoid overfitting fall into two broad categories: dataoriented (using separate data for validation) and representationoriented (penalizing complexity in the model). Both have limitations that are hard to overcome. We argue that fully adequate model evaluation is only possible if the search process by which models are obtained is also taken into account. To this end, we recently proposed a method for processoriented evaluation (POE), and successfully applied it to rule induction (Domingos, 1998). However, for the sake of simplicity this treatment made two rather artificial assumptions. In this paper the assumptions are removed, and a simple formula for model evaluation is obtained. Empirical trials show the new, betterfounded form of POE to be as accurate as the previous one, while further reducing theory sizes.
Availability: PDF
Modeling Decision Tree Performance with the Power Law
Authors: Lewis J. Frey and Douglas H. Fisher, Jr., Computer Science Department, Vanderbilt University
Abstract: This paper discusses the use of a power law to predict decision tree performance. Power laws are fit to learning curves of decision trees trained on data sets from the UCI repository. The learning curves are generated by training C4.5 on different size training sets. The power law predicts diminishing returns in terms of error rate as training set size increase. By characterizing the learning curve with a power law, the error rate for a given size training set can be projected. This projection can be used in estimating the amount of data needed to achieve an acceptable error rate, and the cost effectiveness of further data collection.
Availability: PDF
Efficient Learning using Constrained Sufficient Statistics
Authors: Nir Friedman, Institute of Computer Science, The Hebrew University; Lise Getoor, Computer Science Department, Stanford University
Abstract: Learning Bayesian networks is a central problem for pattern recognition, density estimation and classification. In this paper, we propose a new method for speeding up the computational process of learning Bayesian network structure. This approach uses constraints imposed by the statistics already collected from the data to guide the learning algorithm. This allows us to reduce the number of statistics collected during learning and thus speed up the learning time. We show that our method is capable of learning structure from data more efficiently than traditional approaches. Our technique is of particular importance when the size of the datasets is large or when learning from incomplete data. The basic technique that we introduce is general and can be used to improve learning performance in many settings where sufficient statistics must be computed. In addition, our technique may be useful for alternate search strategies such as branch and bound algorithms.
Availability: PDF
On the geometry of DAG models with hidden variables
Authors: Dan Geiger, David Heckerman, and Christopher Meek, Decision Theory & Adaptive Systems, Microsoft Research; Henry King, Mathematics Department, University of Maryland
Abstract: We prove that many graphical models with hidden variables are not curved exponential families. This result, together with the fact that some graphical models are curved and not linear, implies that the hierarchy of graphical models, as linear, curved, and stratified, is noncollapsing; each level in the hierarchy is strictly contained in the larger levels. This result is discussed in the context of model selection of graphical models.
Model Choice: A minimum posterior predictive loss approach
Authors: Sujit. Ghosh, Department of Statistics, NC State University; Alan E. Gelfand, Department of Statistics, University of Connecticut
Abstract: Model choice is a fundamental activity in the analysis of data sets, an activity which has become increasingly more important as computational advances enable the fitting of increasingly complex models. Such complexity typically arises through hierarchical structure which requires specification at each stage of probabilistic mechanisms, mean and dispersion forms, explanatory variables, etc. Nonnested hierarchical models introducing random effects may not be handled by classical methods. Bayesian approaches using predictive distributions can be used though the FORMAL solution, which includes Bayes factors as a special case, can be criticized. It seems natural to evaluate model performance by comparing what it predicts with what has been observed. Most classical criteria utilize such comparison. We propose a predictive criterion where the goal is good prediction of a replicate of the observed data but tempered by fidelity to the observed values. We obtain this criterion by minimizing posterior loss for a given model and then, for models under consideration, selecting the one which minimizes this criterion. For a version of log scoring loss we can do the minimization explicitly, obtaining an expression which can be interpreted as a penalized deviance criterion. We illustrate its performance with an application to a large data set involving residential property transactions.
Availability: PDF
Mean Field Inference in a General Probabilistic Setting
Authors: Michael Haft, Reimar Hofmann, and Volker Tresp, Siemens AG, Corporate Technology, Information and Communications Department
Abstract: We present a systematic, modelindependent formulation of mean field theory (MFT) as an inference method in probabilistic models. “Modelindependent” means that we do not assume a particular type of dependency among the variables of a domain but instead work in a general probabilistic setting. In a Bayesian network, for example, you may use arbitrary tables to specify conditional dependencies and thus run MFT in any Bayesian network. Furthermore, the general mean field equations derived here shed a light on the essence of MFT. MFT can be interpreted as a local iteration scheme which relaxes in a consistent state (a solution of the mean field equations). Iterating the mean field equations means propagating information through the network. In general, however, there are multiple solutions to the mean field equations. We show that improved approximations can be obtained by forming a weighted mixture of the multiple mean field solutions. Simple approximate expressions for the mixture weights are given. The benefits of taking into account multiple solutions are demonstrated by using MFT for inference in a small Bayesian network representing a medical domain. Thereby it turns out that every solution of the mean field equations can be interpreted as a ‘disease scenario’.
Availability: PDF
Probabilistic kernel regression models
Authors: Tommi S. Jaakkola, Department of Computer Science and Electrical Engineering, Massachusetts Institute of Technology; David Haussler, Department of Computer Science, University of California–Santa Cruz
Abstract: We introduce a class of flexible conditional probability models and techniques for classification/regression problems. Many existing methods such as generalized linear models and support vector machines are subsumed under this class. The flexibility of this class of techniques comes from the use of kernel functions as in support vector machines, and the generality from dual formulations of standard regression models.
*The work was done while T. Jaakkola was at UC Santa Cruz.
Hierarchical MixturesofExperts for Generalized Linear Models: Some Results on Denseness and Consistency
Authors: Wenxin Jiang and Martin A. Tanner, Department of Statistics, Northwestern University
Abstract: We investigate a class of hierarchical mixturesofexperts (HME) models where exponential family regression models with generalized linear mean functions of the form $\psi(a+x^T b)$ are mixed. Here $\psi(\cdot)$ is the inverse link function. Suppose the true response $y$ follows an exponential family regression model with mean function belonging to a class of smooth functions of the form $\psi(h(x))$ where $h \in W_{2;K_0}^\infty$ (a Sobolev class over $[0,1]^{s}$). It is shown that the HME mean functions can approximate the true mean function, at a rate of $O(m^{2/s})$ in $L_p$ norm. Moreover, the HME probability density functions can approximate the true density, at a rate of $O(m^{2/s})$ in Hellinger distance, and at a rate of $O(m^{4/s})$ in KullbackLeibler divergence. These rates can be achieved within the family of HME structures with a tree of binary splits, or within the family of structures with a single layer of experts. Here $s$ is the dimension of the predictor $x$. It is also shown that likelihoodbased inference based on HME is consistent in recovering the truth, in the sense that as the sample size $n$ and the number of experts $m$ both increase, the mean square error of the estimated mean response goes to zero. Conditions for such results to hold are stated and discussed.
Availability: PDF
Stochastic local search for Bayesian networks
Authors: Kalev Kask and Rina Dechter, Department of Information and Computer Science, University of California–Irvine
Abstract: The paper evaluates empirically the suitability of Stochastic Local Search algorithms (SLS) for finding most probable explanations in Bayesian networks. SLS algorithms (e.g. GSAT, WSAT) have recently proven to be highly effective in solving complex constraintsatisfaction and satisfiability problems which cannot be solved by traditional search schemes. Our experiments investigate the applicability of this scheme to probabilistic optimization problems.Specifically, we show that algorithms combining hillclimbing steps with stochastic steps (guided by the network’s probability distribution) called G+StS, outperform pure hillclimbing search, pure stochastic simulation search, as well as simulated annealing. In addition, variants of G+StS that are augmented on top of alternative approximation methods are shown to be particularly effective.
Availability: PDF
Bayesian Graphical Models, IntentiontoTreat, and the Rubin Causal Model
Author: David Madigan, Department of Statistics, University of Washington
Abstract: In clinical trials with significant noncompliance the standard intentiontotreat analyses sometimes mislead. Rubin’s causal model provides an alternative method of analysis that can shed extra light on clinical trial data. Formulating the Rubin Causal Model as a Bayesian graphical model facilitates model communication and computation.
Availability: PDF
Efficient Mining of Statistical Dependencies
Authors: Tim Oates, Matthew D. Schmill, Paul R. Cohen, and Casey Durfee, Experimental Knowledge Systems Lab, Department of Computer Science, University of Massachusetts–Amherst
Abstract: The MultiStream Dependency Detection algorithm finds rules that capture statistical dependencies between patterns in multivariate time series of categorical data. Rule strength is measured by the G statistic, and an upper bound on the value of G for the descendants of a node allows MSDD’s search space to be pruned. However, in the worst case, the algorithm will explore exponentially many rules. This paper presents and empirically evaluates two ways of addressing this problem. The first is a set of three methods for reducing the size of MSDD’s search space based on information collected during the search process. Second, we discuss an implementation of MSDD that distributes its computations over multiple machines on a network.
Tractable structure search in the presence of latent variables
Authors: Thomas Richardson, Heiko Bailer, and Moulinath Banerjees, Department of Statistics, University of Washington
Abstract: The problem of learning the structure of a DAG model in the presence of latent variables presents many formidable challenges. In particular there are an infinite number of latent variable models to consider, and these models possess features which make them hard to work with. We describe a class of graphical models which can represent the conditional independence structure induced by a latent variable model over the observed margin. We give a parametrization of the set of Gaussian distributions with conditional independence structure given by a MAG model. The models are illustrated via a simple example. Different estimation techniques are discussed in the context of Zellner’s Seemingly Unrelated Regression (SUR) models.
Boosting Methodology for Regression Problems
Authors: Greg Ridgeway, David Madigan, and Thomas Richardson, Department of Statistics, University of Washington
Abstract: Classification problems have dominated research on boosting to date. The application of boosting to regression problems, on the other hand, has received little investigation. In this paper we develop a new boosting method for regression problems. We cast the regression problem as a classification problem and apply an interpretable form of the boosted naïve Bayes classifier. This induces a regression model that we show to be expressible as an additive model for which we derive estimators and discuss computational issues. We compare the performance of our boosted naïve Bayes regression model with other interpretable multivariate regression procedures.
Availability: PDF
An Experiment in Causal Inference Using a Pneumonia Database
Authors: Peter Spirtes, Department of Philosophy, Carnegie Mellon University; Greg Cooper, Center for Biomedical Informatics, University of Pittsburgh
Abstract: We tested a causal discovery algorithm on a database of pneumonia patients. The output of the causal discovery algorithm is a list of statements “A causes B”, where A and B are variables in the database, and a score indicating the degree of confidence in the statement. We compared the output of the algorithm with the opinions of physicians about whether A caused B or not. We found that the doctors opinions were independent of the output of the algorithm. However, an examination of the output of results suggested a simple, well motivated modification of the algorithm which would bring the output of the algorithm into high agreement with the physicians opinions.
Availability: PDF
A Note on the Comparison of Polynomial Selection Methods
Authors: Murlikrishna Viswanathan and Professor Chris Wallace, School of Computer Science and Software Engineering, Monash University–Clayton
Abstract: Minimum Message Length (MML) and Structural Risk Minimisation (SRM) are two computational learning principles that have achieved wide acclaim in recent years. Whereas the former is based on Bayesian learning and the latter on the classical theory of VCdimension, they are similar in their attempt to define a tradeoff between model complexity and goodness of fit to the data. A recent empirical study by Wallace compared the performance of standard model selection methods in a onedimensional polynomial regression framework. The results from this study provided strong evidence in support of the MML and SRM based methods over the other standard approaches. In this paper we present a detailed empirical evaluation of three model selection methods which include an MML based approach and two SRM based methods. Results from our analysis and experimental evaluation suggest that the MMLbased approach in general has higher predictive accuracy and also raise questions on the inductive capabilities of the Structural Risk Minimization Principle.
Poster Sessions
Poster Sessions  January 4
Transfer of Information between System and Evidence Models
Authors: Russell Almond, Research Statistics Group at Educational Testing Service; Edward Herskovits, Noetic Systems, Inc.; Robert J. Mislevy, Model Based Measurement Group at Educational Testing Service; Linda Stienberg, Educational Policy Research at Educational Testing Service
Abstract: In this paper we illustrate a simple scheme for dividing a complex Bayes network into a system model and a collection of smaller evidence models. While the system model maintains a permanent record of the state of the system of interest, the evidence models are only used momentarily to absorb evidence from specific observations or findings and then discarded. This paper describes an implementation of a system model—evidence model complex in which each system and evidence model has a separate Bayes net and Markov tree representation. As necessary, information is propagated between common Markov tree nodes of the evidence and system models. While mathematically equivalent to the full Bayes network, the system model–evidence model complex allows us to (a) separate the seldom used evidence model portions from the core system model thus reducing search and propagation time in the network and (b) easily replace the evidence models (this is particular advantageous in educational examples in which new test items are often introduced to prevent overexposure of assessment tasks).
A Bayesian Model for Collaborative Filtering
Authors: YungHsin Chien and Edward I. George, Department of MSIS at University of Texas at Austin
Abstract: Consider the general setup where a set of items have been partially rated by a set of judges, in the sense that not every item has been rated by every judge. For this setup, we propose a Bayesian approach for the problem of predicting the missing ratings from the observed ratings. This approach incorporates similarity by assuming the set of judges can be partitioned into groups which share the same ratings probability distribution. This leads to a predictive distribution of missing ratings based on the posterior distribution of the groupings and associated ratings probabilities. Markov chain Monte Carlo methods and a hybrid search algorithm are then used to obtain predictions of the missing ratings.
Parameter learning from incomplete data for Bayesian networks
Author: RG Cowell
Abstract: In a companion paper (Cowell 1999), I described a method of using maximum entropy to estimate the joint probability distribution for a set of discrete variables from missing data. Here I extend the method of that paper to incorporate prior information for application to parameter learning in Bayesian networks.
On the Application of The Bootstrap for Computing Confidence Measures on Features of Induced Bayesian Networks
Authors: Nir Friedman, Institute of Computer Science at Hebrew University; Moises Goldszmidt, SRI International; Abraham Wyner, Department of Statistics at Wharton School
Abstract: In the context of learning Bayesian networks from data, very little work has been published on methods for assessing the quality of an induced model. This issue, however, has received a great deal of attention in the statistics literature. In this paper, we take a wellknown method from statistics, Efron’s Bootstrap, and examine its applicability for assessing a confidence measure on features of the learned network structure. We also compare this method to assessments based on a practical realization of the Bayesian methodology.
Relaxing the Local Independence Assumption for Quantitative Learning in Acyclic Directed Graphical Models through Hierarchical Partition Models
Authors: Daniela Golinelli and David Madigan, Department of Statistics at University of Washington and Guido Consonni, Dip. di Economia e Metodi Quantitativi at Universita’ di Pavia
Abstract: The simplest method proposed by Spiegelhalter and Lauritzen (1990) to perform quantitative learning in ADG presents a potential weakness: the local independence assumption. We propose to alleviate this problem through the use of Hierarchical Partition Models. Our approach is compared with the previous one from an interpretative and predictive point of view.
The exploration of new methods for learning in binary Boltzmann machines
Authors: Keith Humphreys and D.M. Titterington, Department of Statistics at University of Glasgow
Abstract: Exact inference for Boltzmann machines is computationally expensive. One approach to improving tractability is to approximate the gradient algorithm. We describe a new way of doing this which is based on Bahadur’s representation of the multivariate binary distribution (Bahadur, 1961). We compare the approach, for networks with no unobserved variable, to the “mean field” approximation of Peterson and Anderson (1987) and the approach of Kappen and Rodriguez (1998), which is based on the linear response theorem. We also investigate the use of the pairwise association cluster method (Tanaka and Morita, 1995).
Statistical Challenges to inductive inference in linked data
Author: David Jensen
Abstract: Many data sets can be represented naturally as collections of linked objects. For example, document collections can be represented as documents (nodes) connected by citations and hypertext references (links). Similarly, organizations can be represented as people (nodes) connected reporting relationships, social relationships, and communication patterns (links).
Mixture Model Clustering with the Multimix Program
Authors: Murray A. Jorgensen and Lynette A. Hunt, Department of Statistics at University of Waikato
Abstract: Hunt (1996) has implemented the finite mixture model approach to clustering in a program called Multimix. The program is designed to cluster multivariate data with categorical and continuous variables and possibly containing missing values. The model fitted simultaneously generalises the Latent Class model and the mixture of multivariate normals model. Like either of these models Multimix can be used to form clusters by the Bayes allocation rule. This is the intended use of the program, although the parameter estimates can be used to give a succinct description of the clusters.
Use of the EM algorithm, with its view of the observed data as being notionally augmented by missing information to form the ‘complete data’, gives a broad framework for estimation which is able to handle two types of missing information: unknown cluster assignment and missing data. Using the methodology of Little and Rubin (1987). in this way Multimix is able to handle missing data in a less ad hoc way than many clustering algorithms. The program runs in acceptable time with large data matrices (say hundreds of observations on tens of variables). Use of the missingdata facility increases execution time somewhat. In this presentation we describe the approach taken to the design of Multimix and how some of the statistical problems were dealt with. As examples of the use of the program we cluster a large medical dataset and a version of Fisher’s Iris data in which a third of the values are randomly made ‘missing’.
Learning Augmented Bayesian Classifiers: A Comparison of Distributionbased and Classificationbased Approaches
Authors: Eamonn Keogh and Michael J. Pazzani, Information and Computer Science at University of California, Irvine
Abstract: The nave Bayes classifier is built on the assumption of conditional independence between the attributes given the class. The algorithm has been shown to be surprisingly robust to obvious violations of this condition, but it is natural to ask if it is possible to further improve the accuracy by relaxing this assumption. We examine an approach where nave Bayes is augmented by the addition of correlation arcs between attributes. We explore two methods for finding the set of augmenting arcs, a greedy hillclimbing search, and a novel, more computationally efficient algorithm that we call SuperParent. We compare these methods to TAN; a stateof theart distributionbased approach to finding the augmenting arcs.
Exploring the robustness of Bayesian and informationtheoretic methods for predictive inference
Authors: Petri Kontkanen, Petri Myllymäki, Tomi Silander, Henry Tirri, and Kimmo Valtonen
Abstract: Given a set of sample data, we study three alternative methods for determining the predictive distribution of an unseen data vector. In particular, we are interested in the behavior of the predictive accuracy of these three predictive methods as a function of the degree of the domain assumption violations. We explore this question empirically by using artificially generated data sets, where the assumptions can be violated in various ways. Our empirical results suggest that if the model assumptions are only mildly violated, marginalization over the model parameters may not be necessary in practice. This is due to the fact that in this case the computationally much simpler predictive distribution based on a single, maximum posterior probability model shows similar performance as the computationally more demanding marginal likelihood approach. The results also give support to Rissanen’s theoretical results about the usefulness of using Jereys’ prior distribution for the model parameters.
Availability: PDF
Structure optimization of density estimation models applied to regression problems with dynamic noise
Authors: Martin Kreutz, Bernhard Sendhoff, and Werner von Seelen, Institut für Neuroinformatik at RuhrUniversität Bochum; Anja M. Reimetz and Claus Weihs, Fachbereich Statistik at Universität Dortmund
Abstract: In this paper we deal with the problem of model selection for time series forecasting with dynamical noise and missing data. We employ an evolutionary algorithm to the optimization of a mixture of densities model in order to estimate, via a loglikelihood based quality measure, the joint probability density of the data. We apply our method to the prediction of both artificial time series, generated from the MackeyGlass equation, and time series from a real world system consisting of physiological data of apnea patients.
Related work from the authors:
 Martin Kreutz, Anja M. Reimetz, Bernhard Sendhoff, Claus Weihs and Werner von Seelen. Optimisation of Density Estimation Models with Evolutionary Algorithms. In A.E. Eiben, Th. Bäck, M. Schoenauer and H.P. Schwefel, editors, Parallel Problem Solving from Nature – PPSN V, pages 9981007, Lecture Notes in Computer Science 1498, Springer, 1998.
A learning rule based method of feature extraction with application to acoustic signal classification
Author: M.J. Larkin, The Institute for Brain and Neural Systems at Brown University
Abstract: We apply the Bienenstock, Cooper, and Munro (1982) theory of visual cortical plasticity to the problem of extracting features (i.e., reduction of dimensionality) from acoustic signals; in this case, labeled samples of marine mammal sounds. We first implemented BCM learning in a single neuron model, trained the neuron on samples of acoustic data, and then observed the response when the neuron was tested on different classes of acoustic signals. Next, a multiple neuron network was constructed, with lateral inhibition among the neurons. By training neurons to be selective to inherent features in these signals, we are able to develop networks which can then be used in the design of an automated acoustic signal classifier.
Learning Extensible MultiEntity Directed Graphical Models
Author: Kathryn Blackmond Laskey, Department of Systems Engineering and Operations Research at George Mason University
Abstract: Graphical models have become a standard tool for representing complex probability models in statistics and artificial intelligence. In problems arising in artificial intelligence, it is useful to use the belief network formalism to represent uncertain relationships among variables in the domain, but it may not be possible to use a single, fixed belief network to encompass all problem instances. This is because the number of entities to be reasoned about and their relationships to each other varies from problem instance to problem instance. This paper describes a framework for representing probabilistic knowledge as fragments of belief networks and an approach to learning both structure and parameters from observations.
A latent variable model for multivariate discretization
Authors: Stefano Monti, Intelligent Systems Program at University of Pittsburgh and Gregory F. Cooper, Center for Biomedical Informatics at University of Pittsburgh
Abstract: We describe a new method for multivariate discretization based on the use of a latent variable model. The method is proposed as a tool to extend the scope of applicability of machine learning algorithms that handle discrete variables only. Building upon existing classbased discretization methods, we use a latent variable as a proxy class variable, which is then utilized to drive the partition of the value range of each continuous variable. We present experimental results on simulated data aimed at assessing the merits of the proposed method.
Testing Regression Models With Fewer Regressors
Authors: Judea Pearl and Peyman Meshkat
Abstract: A BASIS for a model M is a minimal set of tests that, if satisfied, implies the satisfaction of all the assumptions behind M. This paper proposes a graphical procedure of recognizing bases of regression models. Using this precedure, it is possible to select a set of tests in which the number of regressors is small, compared with standard tests, thus resulting in improved power.
Learning Conditional Probabilities from Incomplete Databases  An Experimental Comparison
Authors: Marco Ramoni and Paola Sebastiani, The Open University
Abstract: This paper compares three methods – the EM algorithm, Gibbs sampling, and Bound and Collapse (BC) – to estimate conditional probabilities from incomplete databases in a controlled experiment. Results show a substantial equivalence of the estimates provided by the three methods and a dramatic gain in efficiency using BC.
Other information: Further information is available from the home page of the Bayesian Knowledge Discovery project at The Open University.
Local Experts Combination through Density Decomposition
Authors: Ahmed Rida, Abderrahim Labbi, and Christian Pellegrini, Artificial Intelligence group at Centre Universitaire d’Informatique
Abstract: In this paper we describe a divideandcombine strategy for decomposition of a complex prediction problem into simpler local subproblems. We firstly show how to perform a soft decomposition via clustering of input data. Such decomposition leads to a partition of the input space into several regions which may overlap. Therefore, to each region is assigned a local predictor (or expert) which is trained only on local data. To construct a solution to the global prediction problem, we combine the local experts using two approaches: weighted averaging where the outputs of local experts are weighted by their prior densities, and nonlinear adaptive combination where the pooling parameters are obtained through minimization of a global error. To illustrate the validity of our approach, we show simulation results for two classification tasks, vowels and phonemes, using local experts which are MultiLayer Perceptrons (MLP) and Support Vector Machines (SVM). We compare the results obtained using the two local combination modes with the results obtained using a global predictor and a linear combination of global predictors.
EntropyDriven Inference and Inconsistency
Authors: Wilhelm Rödder and Longgui Xu, FernUniversität Gesamthochschule in HagenFachbereich Wirschaftswissenschaft, Lehrstuhl für Betriebswirtschaftslehre, insb. Operations Research
Abstract: Probability distributions on a set of discrete variables are a suitable means to represent knowledge about their respective mutual dependencies. When now things become evident such a distribution can be adapted to the new situation and hence submitted to a sound inference process. Knowledge acquisition and inference are here performed in the rich syntax of conditional events. Both, acquisition and inference respect a sophisticated principle, namely that of maximum entropy and of minimum relative entropy. The freedom to formulate and derive knowledge in a language of rich syntax is comfortable but involves the danger of contradictions or inconsistencies. We develop a method how to solve such inconsistencies which go back to the incompatibility of experts’ knowledge in their respective branches. The method is applied to the diagnosis in Chinese medicine. All calculations are performed in the Entropydriven expert system shell SPIRIT.
Availability: PDF
References:
[1] I. Csiszár: IDivergence Geometry of Probability Distributions and Minimisation Problems. The Annals of Probability 3, (1): 146 – 158 (1975).
[2] I. Csiszár: Why Least Squares and Maximum Entropy? An Axiomatic Approach to Inference for Linear Inverse Problems. The Annals of Statistics 19 (4): 2032 – 2066 (1991).
[3] G. KernIsberner: Characterising the principle of minimum crossentropy within a conditionallogical framework. Artificial Intelligence 98: 169208 (1998).
[4] S. L. Lauritzen: Graphical Association Models (Draft), Technical Report IR 932001, Institute for Electronic Systems, Dept. of Mathematics and Computer Science, Aalborg University (1993).
[5] W. Rödder and G. KernIsberner: Representation and extraction of information by probabilistic logic. Information Systems 21 (8): 637 – 652 (1996).
[6] W. Rödder and C.H. Meyer: Coherent knowledge processing at maximum entropy by SPIRIT, Proceedings 12th Conference on Uncertainty in Artificial Intelligence, E. Horitz and F. Jensen (editors), Morgan Kaufmann, San Francisco, California: 470 – 476 (1996).
[7] C. C. Schnorrenberger: Lehrbuch der chinesischen Medizin für westliche Ärzte, Hippokrates, Stuttgart (1985).
[8] C. E. Shannon: A mathematical theory of communication, Bell System Tech. J. 27, 379423 (part I), 623 – 656 (part II) (1948).
[9] J. E. Shore and R. W. Johnson: Axiomatic Derivation of the Principle of Maximum Entropy and the Principle of Minimum Cross Entropy. IEEE Trans. Information Theory 26 (1): 26 – 37 (1980).
[10] J. Whittaker: Graphical Models in Applied Mathematical Multivariate Statistics, John Wiley & Sons (1990).
Learned Models for Continuous Planning
Authors: Matthew D. Schmill, Tim Oates, and Paul R. Cohen, Experimental Knowledge Systems Laboratory at University of Massachusetts
Abstract: We are interested in the nature of activity — structured behavior of nontrivial duration — in intelligent agents. We believe that the development of activity is a continual process in which simpler activities are composed, via planning, to form more sophisticated ones in a hierarchical fashion. The success or failure of a planner depends on its models of the environment, and its ability to implement its plans in the world. We describe an approach to generating dynamical models of activity from realworld experiences and explain how they can be applied towards planning in a continuous state space.
Efficient Optimization of Large k Realtime Control Algorithm
Authors: Delphi Delco Electronics Systems, Restraint Systems Electronics and Daniel H. Loughlin, North Carolina State University
Abstract: Resource requirements for global optimization increase dramatically with the number of realvalued decision variables (k). Efficient search strategies are needed to satisfy constraints of time, effort, and funding. In this paper, a conjunction of several disparate methods is used to automatically calibrate a nonlinear realtime control used in the automotive industry. By combining a response surface methodology with a hybrid genetic algorithm search, airbag deployment calibrations can be automated, producing solutions superior to conventional manual search.
[panel header="Model Folding for Data Subject to Nonresponse"]
Authors: Paola Sebastiani, The Open University and Marco Ramoni, Knowledge Media Institute at The Open University
Abstract: In this paper we introduce a deterministic method to estimate the posterior probability of rival models from data with partially ignorable nonresponse. The accuracy of the method will be shown via an application to synthetic data.
Other information: Futher information is available from the home page of the Bayesian Knowledge Discovery project at The Open University.
[panel header="Geometry, Moments and Bayesian Networks with Hidden Variables"]
Authors: Raffaella Settimi, University of Warwick and Jim Q. Smith, University of Warwick
Abstract: The purpose of this paper is to present a systematic way of analysing the geometry of the probability spaces for a particular class of Bayesian networks with hidden variables. It will be shown that the conditional independence statements implicit in such graphical models can be neatly expressed as simple polynomial relationships among central moments. This algebraic framework will enable us to explore and identify the structural constraints on the sample space induced by models with tree structures and therefore characterise the families of distributions consistent with such conditional independence assumptions.
Joint probabilistic clustering of multivariate and sequential data
Author: Padhraic Smyth, University of California–Irvine
Consider the following problem. We have a set of individuals (a random sample from a larger population) whom we would like to cluster into groups based on observational data. For each individual we can measure characteristics which are relatively static (eg, their height, weight, income, age, sex, etc). Probabilistic modelbased clustering in this contextusually takes the form of a nite mixture model, where each component in the mixture is a multivariate probability density function (or distribution function) for a particular group. This approach has been found to be a useful general technique for extracting hidden structure from multivariate data (Ban eld and Raftery, 1993; Thiesson et al, 1997).
Analysis of multivariate time series via a hidden graphical model
Authors: Elena Stanghellini, Universita’ di Perugia and Joe Whittaker, Lancaster University
Abstract: We propose a chain graph with unobserved variables to model a multivariate time series. We assume that an underlying common trend linearly affects the observed time series, but we do not restrict our analysis to models where the underlying factor accounts for all the contemporary correlations of the series. The residual correlation is modelled using results of graphical models. Modelling the associations left unexplained is an alternative to augmenting the dimension of the underlying factor. It is justified when a clear interpretation of the residual associations is available. It is also an informative way to explore sources of deviation from standard dynamic singlefactor models.
Visual design support for probabilistic network application
Author: Axel Vogler, Daimler Benz AG
Abstract: Understanding inference in probabilistic networks is an important point in the design phase. Their causal structure and locally defined parameters are intuitive to human experts. The global system induced by the local parameters can lead to results not intended by the human experts. To support network design an edge coloring scheme explaining influences between variables responsible for inference result is introduced.