January 3, 1999 – January 6, 1999

Uncertainty 99

Location: Fort Lauderdale, FL, USA

Poster Sessions | January 4

Authors: Russell Almond, Research Statistics Group at Educational Testing Service (opens in new tab); Edward Herskovits, Noetic Systems, Inc.; Robert J. Mislevy, Model Based Measurement Group at Educational Testing Service; Linda Stienberg, Educational Policy Research at Educational Testing Service

Abstract: In this paper we illustrate a simple scheme for dividing a complex Bayes network into a system model and a collection of smaller evidence models. While the system model maintains a permanent record of the state of the system of interest, the evidence models are only used momentarily to absorb evidence from specific observations or findings and then discarded. This paper describes an implementation of a system model—evidence model complex in which each system and evidence model has a separate Bayes net and Markov tree representation. As necessary, information is propagated between common Markov tree nodes of the evidence and system models. While mathematically equivalent to the full Bayes network, the system model–evidence model complex allows us to (a) separate the seldom used evidence model portions from the core system model thus reducing search and propagation time in the network and (b) easily replace the evidence models (this is particular advantageous in educational examples in which new test items are often introduced to prevent overexposure of assessment tasks).
Authors: Yung-Hsin Chien and Edward I. George, Department of MSIS at University of Texas at Austin

Abstract: Consider the general setup where a set of items have been partially rated by a set of judges, in the sense that not every item has been rated by every judge. For this setup, we propose a Bayesian approach for the problem of predicting the missing ratings from the observed ratings. This approach incorporates similarity by assuming the set of judges can be partitioned into groups which share the same ratings probability distribution. This leads to a predictive distribution of missing ratings based on the posterior distribution of the groupings and associated ratings probabilities. Markov chain Monte Carlo methods and a hybrid search algorithm are then used to obtain predictions of the missing ratings.
Author: RG Cowell

Abstract: In a companion paper (Cowell 1999), I described a method of using maximum entropy to estimate the joint probability distribution for a set of discrete variables from missing data. Here I extend the method of that paper to incorporate prior information for application to parameter learning in Bayesian networks.
Authors: Nir Friedman (opens in new tab), Institute of Computer Science (opens in new tab) at Hebrew University; Moises Goldszmidt, SRI International; Abraham Wyner, Department of Statistics at Wharton School

Abstract: In the context of learning Bayesian networks from data, very little work has been published on methods for assessing the quality of an induced model. This issue, however, has received a great deal of attention in the statistics literature. In this paper, we take a well-known method from statistics, Efron’s Bootstrap, and examine its applicability for assessing a confidence measure on features of the learned network structure. We also compare this method to assessments based on a practical realization of the Bayesian methodology.
Authors: Daniela Golinelli and David Madigan, Department of Statistics (opens in new tab) at University of Washington and Guido Consonni, Dip. di Economia e Metodi Quantitativi at Universita’ di Pavia

Abstract: The simplest method proposed by Spiegelhalter and Lauritzen (1990) to perform quantitative learning in ADG presents a potential weakness: the local independence assumption. We propose to alleviate this problem through the use of Hierarchical Partition Models. Our approach is compared with the previous one from an interpretative and predictive point of view.
Authors: Keith Humphreys and D.M. Titterington, Department of Statistics (opens in new tab) at University of Glasgow

Abstract: Exact inference for Boltzmann machines is computationally expensive. One approach to improving tractability is to approximate the gradient algorithm. We describe a new way of doing this which is based on Bahadur’s representation of the multivariate binary distribution (Bahadur, 1961). We compare the approach, for networks with no unobserved variable, to the “mean field” approximation of Peterson and Anderson (1987) and the approach of Kappen and Rodriguez (1998), which is based on the linear response theorem. We also investigate the use of the pairwise association cluster method (Tanaka and Morita, 1995).
Author: David Jensen

Abstract: Many data sets can be represented naturally as collections of linked objects. For example, document collections can be represented as documents (nodes) connected by citations and hypertext references (links). Similarly, organizations can be represented as people (nodes) connected reporting relationships, social relationships, and communication patterns (links).
Authors: Murray A. Jorgensen and Lynette A. Hunt, Department of Statistics (opens in new tab) at University of Waikato

Abstract: Hunt (1996) has implemented the finite mixture model approach to clustering in a program called Multimix. The program is designed to cluster multivariate data with categorical and continuous variables and possibly containing missing values. The model fitted simultaneously generalises the Latent Class model and the mixture of multivariate normals model. Like either of these models Multimix can be used to form clusters by the Bayes allocation rule. This is the intended use of the program, although the parameter estimates can be used to give a succinct description of the clusters.

Use of the EM algorithm, with its view of the observed data as being notionally augmented by missing information to form the ‘complete data’, gives a broad framework for estimation which is able to handle two types of missing information: unknown cluster assignment and missing data. Using the methodology of Little and Rubin (1987). in this way Multimix is able to handle missing data in a less ad hoc way than many clustering algorithms. The program runs in acceptable time with large data matrices (say hundreds of observations on tens of variables). Use of the missing-data facility increases execution time somewhat. In this presentation we describe the approach taken to the design of Multimix and how some of the statistical problems were dealt with. As examples of the use of the program we cluster a large medical dataset and a version of Fisher’s Iris data in which a third of the values are randomly made ‘missing’.
Authors: Eamonn Keogh and Michael J. Pazzani, Information and Computer Science (opens in new tab) at University of California, Irvine

Abstract: The nave Bayes classifier is built on the assumption of conditional independence between the attributes given the class. The algorithm has been shown to be surprisingly robust to obvious violations of this condition, but it is natural to ask if it is possible to further improve the accuracy by relaxing this assumption. We examine an approach where nave Bayes is augmented by the addition of correlation arcs between attributes. We explore two methods for finding the set of augmenting arcs, a greedy hill-climbing search, and a novel, more computationally efficient algorithm that we call SuperParent. We compare these methods to TAN; a state-of the-art distribution-based approach to finding the augmenting arcs.
Authors: Petri Kontkanen, Petri Myllymäki, Tomi Silander, Henry Tirri, and Kimmo Valtonen

Abstract: Given a set of sample data, we study three alternative methods for determining the predictive distribution of an unseen data vector. In particular, we are interested in the behavior of the predictive accuracy of these three predictive methods as a function of the degree of the domain assumption violations. We explore this question empirically by using artificially generated data sets, where the assumptions can be violated in various ways. Our empirical results suggest that if the model assumptions are only mildly violated, marginalization over the model parameters may not be necessary in practice. This is due to the fact that in this case the computationally much simpler predictive distribution based on a single, maximum posterior probability model shows similar performance as the computationally more demanding marginal likelihood approach. The results also give support to Rissanen’s theoretical results about the usefulness of using Jereys’ prior distribution for the model parameters.

Availability: PDF (opens in new tab)
Authors: Martin Kreutz, Bernhard Sendhoff, and Werner von Seelen, Institut für Neuroinformatik (opens in new tab) at Ruhr-Universität Bochum; Anja M. Reimetz and Claus Weihs, Fachbereich Statistik (opens in new tab) at Universität Dortmund

Abstract: In this paper we deal with the problem of model selection for time series forecasting with dynamical noise and missing data. We employ an evolutionary algorithm to the optimization of a mixture of densities model in order to estimate, via a log-likelihood based quality measure, the joint probability density of the data. We apply our method to the prediction of both artificial time series, generated from the Mackey-Glass equation, and time series from a real world system consisting of physiological data of apnea patients.

Related work from the authors:
- Martin Kreutz, Anja M. Reimetz, Bernhard Sendhoff, Claus Weihs and Werner von Seelen. Optimisation of Density Estimation Models with Evolutionary Algorithms. In A.E. Eiben, Th. Bäck, M. Schoenauer and H.P. Schwefel, editors, Parallel Problem Solving from Nature – PPSN V, pages 998-1007, Lecture Notes in Computer Science 1498, Springer, 1998.
Author: M.J. Larkin, The Institute for Brain and Neural Systems (opens in new tab) at Brown University

Abstract: We apply the Bienenstock, Cooper, and Munro (1982) theory of visual cortical plasticity to the problem of extracting features (i.e., reduction of dimensionality) from acoustic signals; in this case, labeled samples of marine mammal sounds. We first implemented BCM learning in a single neuron model, trained the neuron on samples of acoustic data, and then observed the response when the neuron was tested on different classes of acoustic signals. Next, a multiple neuron network was constructed, with lateral inhibition among the neurons. By training neurons to be selective to inherent features in these signals, we are able to develop networks which can then be used in the design of an automated acoustic signal classifier.
Author: Kathryn Blackmond Laskey, Department of Systems Engineering and Operations Research (opens in new tab) at George Mason University

Abstract: Graphical models have become a standard tool for representing complex probability models in statistics and artificial intelligence. In problems arising in artificial intelligence, it is useful to use the belief network formalism to represent uncertain relationships among variables in the domain, but it may not be possible to use a single, fixed belief network to encompass all problem instances. This is because the number of entities to be reasoned about and their relationships to each other varies from problem instance to problem instance. This paper describes a framework for representing probabilistic knowledge as fragments of belief networks and an approach to learning both structure and parameters from observations.
Authors: Stefano Monti, Intelligent Systems Program (opens in new tab) at University of Pittsburgh and Gregory F. Cooper, Center for Biomedical Informatics (opens in new tab) at University of Pittsburgh

Abstract: We describe a new method for multivariate discretization based on the use of a latent variable model. The method is proposed as a tool to extend the scope of applicability of machine learning algorithms that handle discrete variables only. Building upon existing class-based discretization methods, we use a latent variable as a proxy class variable, which is then utilized to drive the partition of the value range of each continuous variable. We present experimental results on simulated data aimed at assessing the merits of the proposed method.
Authors: Judea Pearl and Peyman Meshkat

Abstract: A BASIS for a model M is a minimal set of tests that, if satisfied, implies the satisfaction of all the assumptions behind M. This paper proposes a graphical procedure of recognizing bases of regression models. Using this precedure, it is possible to select a set of tests in which the number of regressors is small, compared with standard tests, thus resulting in improved power.
Authors: Marco Ramoni and Paola Sebastiani, The Open University

Abstract: This paper compares three methods – the EM algorithm, Gibbs sampling, and Bound and Collapse (BC) – to estimate conditional probabilities from incomplete databases in a controlled experiment. Results show a substantial equivalence of the estimates provided by the three methods and a dramatic gain in efficiency using BC.

Other information: Further information is available from the home page of the Bayesian Knowledge Discovery (opens in new tab) project at The Open University (opens in new tab).
Authors: Ahmed Rida, Abderrahim Labbi, and Christian Pellegrini, Artificial Intelligence group (opens in new tab) at Centre Universitaire d’Informatique (opens in new tab)

Abstract: In this paper we describe a divide-and-combine strategy for decomposition of a complex prediction problem into simpler local sub-problems. We firstly show how to perform a soft decomposition via clustering of input data. Such decomposition leads to a partition of the input space into several regions which may overlap. Therefore, to each region is assigned a local predictor (or expert) which is trained only on local data. To construct a solution to the global prediction problem, we combine the local experts using two approaches: weighted averaging where the outputs of local experts are weighted by their prior densities, and nonlinear adaptive combination where the pooling parameters are obtained through minimization of a global error. To illustrate the validity of our approach, we show simulation results for two classification tasks, vowels and phonemes, using local experts which are Multi-Layer Perceptrons (MLP) and Support Vector Machines (SVM). We compare the results obtained using the two local combination modes with the results obtained using a global predictor and a linear combination of global predictors.
Authors: Wilhelm Rödder and Longgui Xu, FernUniversität Gesamthochschule in HagenFachbereich Wirschaftswissenschaft, Lehrstuhl für Betriebswirtschaftslehre, insb. Operations Research (opens in new tab)

Abstract: Probability distributions on a set of discrete variables are a suitable means to represent knowledge about their respective mutual dependencies. When now things become evident such a distribution can be adapted to the new situation and hence submitted to a sound inference process. Knowledge acquisition and inference are here performed in the rich syntax of conditional events. Both, acquisition and inference respect a sophisticated principle, namely that of maximum entropy and of minimum relative entropy. The freedom to formulate and derive knowledge in a language of rich syntax is comfortable but involves the danger of contradictions or inconsistencies. We develop a method how to solve such inconsistencies which go back to the incompatibility of experts’ knowledge in their respective branches. The method is applied to the diagnosis in Chinese medicine. All calculations are performed in the Entropy-driven expert system shell SPIRIT.

Availability: PDF (opens in new tab)

References:

[1] I. Csiszár: I-Divergence Geometry of Probability Distributions and Minimisation Problems. The Annals of Probability 3, (1): 146 – 158 (1975).[2] I. Csiszár: Why Least Squares and Maximum Entropy? An Axiomatic Approach to Inference for Linear Inverse Problems. The Annals of Statistics 19 (4): 2032 – 2066 (1991).[3] G. Kern-Isberner: Characterising the principle of minimum cross-entropy within a conditional-logical framework. Artificial Intelligence 98: 169-208 (1998).[4] S. L. Lauritzen: Graphical Association Models (Draft), Technical Report IR 93-2001, Institute for Electronic Systems, Dept. of Mathematics and Computer Science, Aalborg University (1993).[5] W. Rödder and G. Kern-Isberner: Representation and extraction of information by probabilistic logic. Information Systems 21 (8): 637 – 652 (1996).[6] W. Rödder and C.-H. Meyer: Coherent knowledge processing at maximum entropy by SPIRIT, Proceedings 12th Conference on Uncertainty in Artificial Intelligence, E. Horitz and F. Jensen (editors), Morgan Kaufmann, San Francisco, California: 470 – 476 (1996).[7] C. C. Schnorrenberger: Lehrbuch der chinesischen Medizin für westliche Ärzte, Hippokrates, Stuttgart (1985).[8] C. E. Shannon: A mathematical theory of communication, Bell System Tech. J. 27, 379-423 (part I), 623 – 656 (part II) (1948).[9] J. E. Shore and R. W. Johnson: Axiomatic Derivation of the Principle of Maximum Entropy and the Principle of Minimum Cross Entropy. IEEE Trans. Information Theory 26 (1): 26 – 37 (1980).[10] J. Whittaker: Graphical Models in Applied Mathematical Multivariate Statistics, John Wiley & Sons (1990).
Authors: Matthew D. Schmill, Tim Oates, and Paul R. Cohen, Experimental Knowledge Systems Laboratory at University of Massachusetts

Abstract: We are interested in the nature of activity — structured behavior of nontrivial duration — in intelligent agents. We believe that the development of activity is a continual process in which simpler activities are composed, via planning, to form more sophisticated ones in a hierarchical fashion. The success or failure of a planner depends on its models of the environment, and its ability to implement its plans in the world. We describe an approach to generating dynamical models of activity from real-world experiences and explain how they can be applied towards planning in a continuous state space.
Authors: Delphi Delco Electronics Systems, Restraint Systems Electronics and Daniel H. Loughlin, North Carolina State University

Abstract: Resource requirements for global optimization increase dramatically with the number of real-valued decision variables (k). Efficient search strategies are needed to satisfy constraints of time, effort, and funding. In this paper, a conjunction of several disparate methods is used to automatically calibrate a non-linear real-time control used in the automotive industry. By combining a response surface methodology with a hybrid genetic algorithm search, air-bag deployment calibrations can be automated, producing solutions superior to conventional manual search.

Authors: Paola Sebastiani, The Open University and Marco Ramoni, Knowledge Media Institute at The Open University

Abstract: In this paper we introduce a deterministic method to estimate the posterior probability of rival models from data with partially ignorable nonresponse. The accuracy of the method will be shown via an application to synthetic data.

Other information: Futher information is available from the home page of the Bayesian Knowledge Discovery (opens in new tab) project at The Open University (opens in new tab).

Authors: Raffaella Settimi, University of Warwick and Jim Q. Smith, University of Warwick

Abstract: The purpose of this paper is to present a systematic way of analysing the geometry of the probability spaces for a particular class of Bayesian networks with hidden variables. It will be shown that the conditional independence statements implicit in such graphical models can be neatly expressed as simple polynomial relationships among central moments. This algebraic framework will enable us to explore and identify the structural constraints on the sample space induced by models with tree structures and therefore characterise the families of distributions consistent with such conditional independence assumptions.

Author: Padhraic Smyth, University of California–Irvine

Consider the following problem. We have a set of individuals (a random sample from a larger population) whom we would like to cluster into groups based on observational data. For each individual we can measure characteristics which are relatively static (eg, their height, weight, income, age, sex, etc). Probabilistic model-based clustering in this contextusually takes the form of a nite mixture model, where each component in the mixture is a multivariate probability density function (or distribution function) for a particular group. This approach has been found to be a useful general technique for extracting hidden structure from multivariate data (Ban eld and Raftery, 1993; Thiesson et al, 1997).

Authors: Elena Stanghellini, Universita’ di Perugia and Joe Whittaker, Lancaster University

Abstract: We propose a chain graph with unobserved variables to model a multivariate time series. We assume that an underlying common trend linearly affects the observed time series, but we do not restrict our analysis to models where the underlying factor accounts for all the contemporary correlations of the series. The residual correlation is modelled using results of graphical models. Modelling the associations left unexplained is an alternative to augmenting the dimension of the underlying factor. It is justified when a clear interpretation of the residual associations is available. It is also an informative way to explore sources of deviation from standard dynamic single-factor models.

Author: Axel Vogler, Daimler Benz AG

Abstract: Understanding inference in probabilistic networks is an important point in the design phase. Their causal structure and locally defined parameters are intuitive to human experts. The global system induced by the local parameters can lead to results not intended by the human experts. To support network design an edge coloring scheme explaining influences between variables responsible for inference result is introduced.