Thursday, December 1, 2005
4:15 pm
B17 Upson Hall

Computer Science
Fall 2005

Inderjit Dhillon
University of Texas at Austin

Co-clustering, Matrix Approximations and Bregman Divergences

Many applications in data mining and machine learning require the analysis of large two-dimensional matrices. Depending on the application, these matrices can have very different characteristics: in text mining, co-occurrence matrices are large, sparse and non-negative while DNA microarray analysis yields smaller, dense matrices with positive as well as negative entries. In data analysis, it is often desirable (a) to find "low-parameter" matrix approximations, and (b) to "co-cluster" such matrices, i.e., simultaneously cluster rows as well as columns.

In this talk, I will discuss a framework that inextricably links a certain class of matrix approximations with co-clustering. The approximation error can be measured using a non-trivial class of distortion measures, called Bregman divergences, that have connections to the exponential family of probability distributions. Our algorithms allow us to handle a wide variety of matrices and distortion measures within a unifying framework, and are able to efficiently construct good co-clusterings and the corresponding low-parameter matrix approximations. I will conclude by presenting experimental results on text and microarray data. Extensions to tensors will be briefly discussed.