Tutorials | January 3
-
Speaker: Susan Dumais, Microsoft Research | 8:30 AM – 10:30 AM
The Web has made literally terabytes of information available at the click of a mouse. The challenge is in finding the right information. Information retrieval is concerned with providing access to textual data for which we have no good formal model, such as a relational model. Statistical approaches have been widely applied to this problem. This tutorial will provide an overview of: a) statistical characteristics of large text collections (e.g., size, sparcity, word distributions), b) important retrieval models (e.g., Boolean, vector space and probabilistic), and c) enhancements which use unsupervised learning to model structure in text collections, or supervised learning to incorporate user feedback. We will conclude with a discussion of open research issues where improved statistical models can improve performance.
-
Speaker: David Spiegelhalter (opens in new tab), MRC Biostatistics Unit, Institute for Public Health, Cambridge | 11:00 AM – 12:00 AM and 1:30 PM – 2:30 PM
The first part of the tutorial will cover the fundamentals of Bayesian inference, including probability and its subjective interpretation, evaluation of probability assessments using scoring rules, utilities and decision theory. The use of Bayes theorem for updating beliefs will be illustrated for both binomial and normal likelihoods, and the use of conjugate families of priors and predictive distributions described. The First Bayes software will be used to display conjugate Bayesian analysis. The second part will introduce the concept of `exchangeability’, and the consequent use of hierarchical models in which the unknown parameters of a common prior are included in the model. Conditional independence assumptions lead naturally to a graphical representation of hierarchical models. Markov chain Monte Carlo (MCMC) methods will be introduced as a means of carrying out the necessary numerical integrations, and topics covered will include the relationship of Gibbs sampling to graphical modelling, parameterisation, initial values, and choice of prior distributions. Real examples will be used throughout, and on-line analysis of an example in longitudinal modelling with measurement error on predictors will be carried out using the WinBUGS program.
-
Speaker: Trevor Hastie (opens in new tab), Stanford University | 3:30 PM – 5:00 PM
Boosting (Freund and Schapire, 1995) is one of the most important recent developments in classification methodology. Boosting works by sequentially applying a classification algorithm to reweighted versions of the training data, and then taking a weighted majority vote of the sequence of classifiers thus produced. For many classification algorithms, this simple strategy results in dramatic improvements in performance. We show that this seemingly mysterious phenomenon can be understood in terms of well known statistical principles, namely additive modeling and maximum likelihood. For the two-class problem, boosting can be viewed as an approximation to additive modeling on the logistic scale using maximum Bernoulli likelihood as a criterion. We develop more direct approximations and show that they exhibit nearly identical results to boosting. Directmulti-class generalizations based on multinomial likelihood are derived that exhibit performance comparable to other recently proposed multi-class generalizations of boosting in most situations, and far superior in some. We suggest a minor modification to boosting that can reduce computation, often by factors of 10 to 50. Finally, we apply these insights to produce an alternative formulation of boosting decision trees. This approach, based on best-first truncated tree induction, often leads to better performance, and can provide interpretable descriptions of the aggregate decision rule. It is also much faster computationally, making it more suitable to large scale data mining applications.
* joint work with Jerome Friedman and Rob Tibshirani
Abstracts | January 4 – 6
-
Author: Hagai Attias, Gatsby Computational Neuroscience Unit (opens in new tab), University College London
Abstract: We introduce a new real-valued belief network, which is a multilayer generalization of independent factor analysis (IFA). At each level, this network extracts real-valued latent variables that are non-linear functions of the input data with a highly adaptive functional form, resulting in a hierarchical distributed representation of these data. The network is based on a probabilistic generative model, constructed by cascading single-layer IFA models. Whereas exact maximum-likelihood learning for this model is intractable, we present and demonstrate an algorithm that maximizes a lower bound on the likelihood. This algorithm is developed by formulating a variational approach to hierarchical IFA networks.
Availability: PDF
-
Author: Matthew Brand (opens in new tab), Mitsubishi Electric Research Labs (opens in new tab)
Abstract: We propose a framework for learning hidden-variable models by optimizing entropies, in which entropy minimization, posterior maximization, and free energy minimization are all equivalent. Solutions for the maximum a posteriori (MAP) estimator yield powerful learning algorithms that combine all the charms of expectation-maximization and deterministic annealing. Contained as special cases are the methods of maximum entropy, maximum likelihood, and a new method, maximum structure. We focus on the maximum structure case, in which entropy minimization maximizes the amount of evidence supporting each parameter while minimizing uncertainty in the sufficient statistics and cross-entropy between the model and the data. In iterative estimation, the MAP estimator gradually extinguishes excess parameters, sculpting a model structure that reflects hidden structures in the data. These models are highly resistant to over-fitting and have the particular virtue of being easy to interpret, often yielding insights into the hidden causes that generate the data.
Availability: PDF
-
Author: Louis Anthony (“Tony”) Cox, Jr., Cox Associates
Abstract: Classification trees, usually used as a nonlinear, nonparametric classification method, can also provide a powerful framework for comparing, assessing, and combining information from different expert systems, by treating their predictions as the independent variables in a classification tree analysis. This paper discusses the applied problem of classifying chemicals as human carcinogens. It shows how classification trees can be used to compare the information provided by ten different carcinogen classification expert systems, construct an improved “hybrid” classification system from them, and identify cost-effective combinations of assays (the inputs to the expert systems) to use in classifying chemicals in future.
Availability: PDF
-
Authors: A. Philip Dawid, Department of Statistical Science (opens in new tab), University College London; Milan Studeny (opens in new tab), Institute of Information Theory and Automation (opens in new tab), Academy of Sciences of Czech Republic, and Laboratory of Intelligent Systems, University of Economics Prague
Abstract: We introduce a new abstract approach to the study of conditional independence, founded on a concept analogous to the factorization properties of probabilistic independence, rather than the separation properties of a graph. The basic ingredient is the “conditional product”, which provides a way of combining the basic objects under consideration while preserving as much independence as possible. We introduce an appropriate axiom system for conditional product, and show how, when these axioms are obeyed, they induce a derived concept of conditional independence which obeys the usual semi-graphoid axioms. The general structure is used to throw light on three specific areas: the familiar probabilistic framework (both the discrete and the general case); a set-theoretic framework related to “variation independence”; and a variety of graphical frameworks.
Availability: PDF
-
Authors: Jan De Geeter and Marc Decréton, SCK.CEN (opens in new tab) (Belgian Nuclear Research Centre); Joris De Schutter, Herman Bruyninckx, and Hendrik Van Brussel, Department of Mechanical Engineering, Division PMA (opens in new tab), Katholieke Universiteit Leuven
Abstract: This paper is about the task-directed updating of an incomplete and inaccurate geometric model of a nuclear environment, using only robust radiation-resistant sensors installed on a robot that is remotely controlled by a human operator. In this problem, there are many sources of uncertainty and ambiguity. This paper proposes a probabilistic solution under Gaussian assumptions. Uncertainty is reduced with an estimator based on a Kalman filter. Ambiguity on the measurement-feature association is resolved by running a bank of those estimators in parallel, one for each plausible association. The residual errors of these estimators are used for hypothesis testing and for the calculation of a probability distribution over the remaining hypotheses. The best next sensing action is calculated as a Bayes decision with respect to a loss function that takes into account both the uncertainty on the current estimate, and the variance/precision required by the task.
-
Author: Pedro Domingos, Artificial Intelligence Group, Instituto Superior Técnico
Abstract: Methods to avoid overfitting fall into two broad categories: data-oriented (using separate data for validation) and representation-oriented (penalizing complexity in the model). Both have limitations that are hard to overcome. We argue that fully adequate model evaluation is only possible if the search process by which models are obtained is also taken into account. To this end, we recently proposed a method for process-oriented evaluation (POE), and successfully applied it to rule induction (Domingos, 1998). However, for the sake of simplicity this treatment made two rather artificial assumptions. In this paper the assumptions are removed, and a simple formula for model evaluation is obtained. Empirical trials show the new, better-founded form of POE to be as accurate as the previous one, while further reducing theory sizes.
Availability: PDF
-
Authors: Lewis J. Frey and Douglas H. Fisher, Jr., Computer Science Department, Vanderbilt University
Abstract: This paper discusses the use of a power law to predict decision tree performance. Power laws are fit to learning curves of decision trees trained on data sets from the UCI repository. The learning curves are generated by training C4.5 on different size training sets. The power law predicts diminishing returns in terms of error rate as training set size increase. By characterizing the learning curve with a power law, the error rate for a given size training set can be projected. This projection can be used in estimating the amount of data needed to achieve an acceptable error rate, and the cost effectiveness of further data collection.
Availability: PDF
-
Authors: Nir Friedman (opens in new tab), Institute of Computer Science (opens in new tab), The Hebrew University; Lise Getoor (opens in new tab), Computer Science Department (opens in new tab), Stanford University
Abstract: Learning Bayesian networks is a central problem for pattern recognition, density estimation and classification. In this paper, we propose a new method for speeding up the computational process of learning Bayesian network structure. This approach uses constraints imposed by the statistics already collected from the data to guide the learning algorithm. This allows us to reduce the number of statistics collected during learning and thus speed up the learning time. We show that our method is capable of learning structure from data more efficiently than traditional approaches. Our technique is of particular importance when the size of the datasets is large or when learning from incomplete data. The basic technique that we introduce is general and can be used to improve learning performance in many settings where sufficient statistics must be computed. In addition, our technique may be useful for alternate search strategies such as branch and bound algorithms.
Availability: PDF
-
Authors: Dan Geiger (opens in new tab), David Heckerman, and Christopher Meek, Decision Theory & Adaptive Systems, Microsoft Research; Henry King (opens in new tab), Mathematics Department (opens in new tab), University of Maryland
Abstract: We prove that many graphical models with hidden variables are not curved exponential families. This result, together with the fact that some graphical models are curved and not linear, implies that the hierarchy of graphical models, as linear, curved, and stratified, is non-collapsing; each level in the hierarchy is strictly contained in the larger levels. This result is discussed in the context of model selection of graphical models.
-
Authors: Sujit. Ghosh, Department of Statistics (opens in new tab), NC State University; Alan E. Gelfand (opens in new tab), Department of Statistics (opens in new tab), University of Connecticut
Abstract: Model choice is a fundamental activity in the analysis of data sets, an activity which has become increasingly more important as computational advances enable the fitting of increasingly complex models. Such complexity typically arises through hierarchical structure which requires specification at each stage of probabilistic mechanisms, mean and dispersion forms, explanatory variables, etc. Nonnested hierarchical models introducing random effects may not be handled by classical methods. Bayesian approaches using predictive distributions can be used though the FORMAL solution, which includes Bayes factors as a special case, can be criticized. It seems natural to evaluate model performance by comparing what it predicts with what has been observed. Most classical criteria utilize such comparison. We propose a predictive criterion where the goal is good prediction of a replicate of the observed data but tempered by fidelity to the observed values. We obtain this criterion by minimizing posterior loss for a given model and then, for models under consideration, selecting the one which minimizes this criterion. For a version of log scoring loss we can do the minimization explicitly, obtaining an expression which can be interpreted as a penalized deviance criterion. We illustrate its performance with an application to a large data set involving residential property transactions.
Availability: PDF
-
Authors: Michael Haft, Reimar Hofmann, and Volker Tresp, Siemens AG, Corporate Technology, Information and Communications Department
Abstract: We present a systematic, model-independent formulation of mean field theory (MFT) as an inference method in probabilistic models. “Model-independent” means that we do not assume a particular type of dependency among the variables of a domain but instead work in a general probabilistic setting. In a Bayesian network, for example, you may use arbitrary tables to specify conditional dependencies and thus run MFT in any Bayesian network. Furthermore, the general mean field equations derived here shed a light on the essence of MFT. MFT can be interpreted as a local iteration scheme which relaxes in a consistent state (a solution of the mean field equations). Iterating the mean field equations means propagating information through the network. In general, however, there are multiple solutions to the mean field equations. We show that improved approximations can be obtained by forming a weighted mixture of the multiple mean field solutions. Simple approximate expressions for the mixture weights are given. The benefits of taking into account multiple solutions are demonstrated by using MFT for inference in a small Bayesian network representing a medical domain. Thereby it turns out that every solution of the mean field equations can be interpreted as a ‘disease scenario’.
Availability: PDF
-
Authors: Tommi S. Jaakkola (opens in new tab), Department of Computer Science and Electrical Engineering, Massachusetts Institute of Technology; David Haussler (opens in new tab), Department of Computer Science, University of California–Santa Cruz
Abstract: We introduce a class of flexible conditional probability models and techniques for classification/regression problems. Many existing methods such as generalized linear models and support vector machines are subsumed under this class. The flexibility of this class of techniques comes from the use of kernel functions as in support vector machines, and the generality from dual formulations of standard regression models.
*The work was done while T. Jaakkola was at UC Santa Cruz.
-
Authors: Wenxin Jiang and Martin A. Tanner, Department of Statistics, Northwestern University
Abstract: We investigate a class of hierarchical mixtures-of-experts (HME) models where exponential family regression models with generalized linear mean functions of the form $psi(a+x^T b)$ are mixed. Here $psi(cdot)$ is the inverse link function. Suppose the true response $y$ follows an exponential family regression model with mean function belonging to a class of smooth functions of the form $psi(h(x))$ where $h in W_{2;K_0}^infty$ (a Sobolev class over $[0,1]^{s}$). It is shown that the HME mean functions can approximate the true mean function, at a rate of $O(m^{-2/s})$ in $L_p$ norm. Moreover, the HME probability density functions can approximate the true density, at a rate of $O(m^{-2/s})$ in Hellinger distance, and at a rate of $O(m^{-4/s})$ in Kullback-Leibler divergence. These rates can be achieved within the family of HME structures with a tree of binary splits, or within the family of structures with a single layer of experts. Here $s$ is the dimension of the predictor $x$. It is also shown that likelihood-based inference based on HME is consistent in recovering the truth, in the sense that as the sample size $n$ and the number of experts $m$ both increase, the mean square error of the estimated mean response goes to zero. Conditions for such results to hold are stated and discussed.
Availability: PDF
-
Authors: Kalev Kask and Rina Dechter, Department of Information and Computer Science, University of California–Irvine
Abstract: The paper evaluates empirically the suitability of Stochastic Local Search algorithms (SLS) for finding most probable explanations in Bayesian networks. SLS algorithms (e.g. GSAT, WSAT) have recently proven to be highly effective in solving complex constraint-satisfaction and satisfiability problems which cannot be solved by traditional search schemes. Our experiments investigate the applicability of this scheme to probabilistic optimization problems.Specifically, we show that algorithms combining hill-climbing steps with stochastic steps (guided by the network’s probability distribution) called G+StS, outperform pure hill-climbing search, pure stochastic simulation search, as well as simulated annealing. In addition, variants of G+StS that are augmented on top of alternative approximation methods are shown to be particularly effective.
Availability: PDF
-
Author: David Madigan, Department of Statistics (opens in new tab), University of Washington
Abstract: In clinical trials with significant noncompliance the standard intention-to-treat analyses sometimes mislead. Rubin’s causal model provides an alternative method of analysis that can shed extra light on clinical trial data. Formulating the Rubin Causal Model as a Bayesian graphical model facilitates model communication and computation.
Availability: PDF
-
Authors: Tim Oates, Matthew D. Schmill, Paul R. Cohen, and Casey Durfee, Experimental Knowledge Systems Lab, Department of Computer Science, University of Massachusetts–Amherst
Abstract: The Multi-Stream Dependency Detection algorithm finds rules that capture statistical dependencies between patterns in multivariate time series of categorical data. Rule strength is measured by the G statistic, and an upper bound on the value of G for the descendants of a node allows MSDD’s search space to be pruned. However, in the worst case, the algorithm will explore exponentially many rules. This paper presents and empirically evaluates two ways of addressing this problem. The first is a set of three methods for reducing the size of MSDD’s search space based on information collected during the search process. Second, we discuss an implementation of MSDD that distributes its computations over multiple machines on a network.
-
Authors: Thomas Richardson (opens in new tab), Heiko Bailer, and Moulinath Banerjees, Department of Statistics (opens in new tab), University of Washington
Abstract: The problem of learning the structure of a DAG model in the presence of latent variables presents many formidable challenges. In particular there are an infinite number of latent variable models to consider, and these models possess features which make them hard to work with. We describe a class of graphical models which can represent the conditional independence structure induced by a latent variable model over the observed margin. We give a parametrization of the set of Gaussian distributions with conditional independence structure given by a MAG model. The models are illustrated via a simple example. Different estimation techniques are discussed in the context of Zellner’s Seemingly Unrelated Regression (SUR) models.
-
Authors: Greg Ridgeway, David Madigan, and Thomas Richardson (opens in new tab), Department of Statistics (opens in new tab), University of Washington
Abstract: Classification problems have dominated research on boosting to date. The application of boosting to regression problems, on the other hand, has received little investigation. In this paper we develop a new boosting method for regression problems. We cast the regression problem as a classification problem and apply an interpretable form of the boosted naïve Bayes classifier. This induces a regression model that we show to be expressible as an additive model for which we derive estimators and discuss computational issues. We compare the performance of our boosted naïve Bayes regression model with other interpretable multivariate regression procedures.
Availability: PDF
-
Authors: Peter Spirtes, Department of Philosophy, Carnegie Mellon University; Greg Cooper, Center for Biomedical Informatics, University of Pittsburgh
Abstract: We tested a causal discovery algorithm on a database of pneumonia patients. The output of the causal discovery algorithm is a list of statements “A causes B”, where A and B are variables in the database, and a score indicating the degree of confidence in the statement. We compared the output of the algorithm with the opinions of physicians about whether A caused B or not. We found that the doctors opinions were independent of the output of the algorithm. However, an examination of the output of results suggested a simple, well motivated modification of the algorithm which would bring the output of the algorithm into high agreement with the physicians opinions.
Availability: PDF
-
Authors: Murlikrishna Viswanathan and Professor Chris Wallace, School of Computer Science and Software Engineering, Monash University–Clayton
Abstract: Minimum Message Length (MML) and Structural Risk Minimisation (SRM) are two computational learning principles that have achieved wide acclaim in recent years. Whereas the former is based on Bayesian learning and the latter on the classical theory of VC-dimension, they are similar in their attempt to define a trade-off between model complexity and goodness of fit to the data. A recent empirical study by Wallace compared the performance of standard model selection methods in a one-dimensional polynomial regression framework. The results from this study provided strong evidence in support of the MML and SRM based methods over the other standard approaches. In this paper we present a detailed empirical evaluation of three model selection methods which include an MML based approach and two SRM based methods. Results from our analysis and experimental evaluation suggest that the MML-based approach in general has higher predictive accuracy and also raise questions on the inductive capabilities of the Structural Risk Minimization Principle.