Speaker: Moses Charikar

Affiliation: Stanford

Host: Jon Kleinberg

Date: 2/10/00

Time & Location: 4:15PM, 101 Phillips

Title: Algorithms for Clustering

Clustering problems arise in various contexts including classification, information retrieval and data mining. Roughly speaking, clustering refers to partitioning a set of objects into groups of similar objects. The objects (e.g. documents or images) are usually represented as points in some space with a distance measure and the objective is to obtain clusters of points that are close to each other. Problems of this flavor also occur in discrete location theory, where the goal is to locate a set of facilities (e.g. factories or warehouses) so as to serve a given set of clients. This talk describes several algorithmic aspects of clustering problems.

The first part of the talk will focus on the approximability of several natural clustering objectives. The quality of a clustering is typically measured by an objective function and the goal of an algorithm is to minimize this. For most natural objective functions, the corresponding optimization problem turns out to be NP-hard. In the face of this fundamental intractability, researchers have shifted their focus from exact solutions to obtaining approximate solutions that are guaranteed to be close to the optimal. I will describe approximation algorithms for classical clustering and location problems such as k-median and facility location. Both linear programming based approaches as well as purely combinatorial approaches such as local search can be used to obtain constant factor approximations for these problems. Further, the ideas behind these algorithms extend to other problems such as min-sum clustering and clustering with excluded outliers.

In the second part of the talk, I will examine other algorithmic issues that arise in clustering problems. I will describe min-wise independent permutations that play an important role in efficiently estimating similarity of documents. They are used in clustering documents to eliminate near duplicates in the AltaVista search index. I will also formulate problems and describe results related to the incremental maintenance of clusters in a dynamic point set.