Semi-supervised Clustering: Probabilistic Models, Algorithms and Experiments
- Sugato Basu | University of Texas, Austin
Clustering is one of the most common data mining tasks, used frequently for data categorization and analysis in both industry and academia. The focus of our research is on semi-supervised clustering, where we study how prior knowledge can be incorporated into clustering algorithms. We present probabilistic models for semi-supervised clustering, develop algorithms based on these models and empirically validate their performances by extensive experiments on datasets from different domains (e.g., text and web data, hand-written character recognition, and bioinformatics).
In many domains where clustering is applied, prior knowledge is naturally available in the form of constraints on some of the instances, specifying whether two instances should be in same or different clusters. We focus in particular on the problem of semi-supervised clustering with constraints. We show that this problem has a well-defined underlying probabilistic model of a Hidden Markov Random Field, and we give convergence guarantees of our algorithm for a large class of clustering distortion measures (e.g., squared Euclidean metric, KL divergence, and cosine distance).
We propose an active learning algorithm for acquiring maximally informative pairwise constraints in an interactive query-driven framework, which to our knowledge is the first active learning algorithm for constrained semi-supervised clustering.
Apart from constrained clustering, we will also discuss other interesting problems of semi-supervised clustering in this talk (e.g., using prior knowledge in the form of category labels on data instances during clustering, incorporating prior knowledge into overlapping clustering of data, semi-supervised graph partitioning using a kernel approach).
Speaker Details
Sugato Basu is a PhD student in the Computer Science Department of University of Texas (UT) at Austin. His PhD advisor is Prof. Raymond J. Mooney and his research interests are in the area of Machine Learning and Data Mining. Before this, he received his MS from the Computer Engineering department of University of California at Santa Cruz, and his BTech (Honors) in Computer Science and Engineering from the Indian Institute of Technology at Kharagpur. He received the Best Research Paper Award at KDD 2004, the IBM PhD Fellowship in 2002 and the MCD Fellowship from the University of Texas in 2000. He is on the Program Committee of KDD and AAAI in 2005 and is the reviewer for multiple journals, including the Journal of Machine Learning Research, Pattern Recognition Letters, Journal of Data Mining and Knowledge Discovery, and IEEE Pattern Analysis and Machine Intelligence. 2 patents have been filed based on his summer research at Google.
-
-
Jeff Running
-
Watch Next
-
-
Accelerating MRI image reconstruction with Tyger
- Karen Easterbrook,
- Ilyana Rosenberg
-
-
-
-
-
-
-
-