Compressed Counting and Its Application in Estimating Entropy of Data Streams

 

 

Ping Li
Cornell University, Department of Statistical Science
 

Monday  September 1, 2008
4:00 PM, 5130 Upson Hall

 

Abstract: Numerous modern applications generate high-speed data streams, for example, search engines, online stores, network traffic, finance data, etc. Developing scalable algorithms for data stream computations is recently considered as one of the 10 challenging problems in data mining research. Based on the idea of maximally-skewed stable random projections, Compressed Counting (CC) is a very efficient algorithm for counting (estimating) the p-th frequency moments of data streams. When p = 1, it is known that one can compute the first moment (sum) trivially using a single counter. When p is close 1, one may speculate that there should exist an intelligent counting system that works almost like a single counter. CC is the first such intelligent counting system, while previous algorithms including the well-known method of symmetric stable random projections (Indyk 2006, Li 2008) still require a large storage space at p =1. In fact, CC breaks the previously known large-deviation bound O(1/ε2), to O(1/ε) as p tends to 1. The improvement of CC over symmetric stable random projections in a sense approaches infinity as p tends to 1.

One application of CC is for estimating entropy of data streams. For example, a FOCS 2008 paper suggested using the p-th moment with p=1+δ to approximate Shannon entropy where |δ| < 0.0001. After presenting the theory of CC, this talk will demonstrate the bias-variance trade-off in estimating Shannon entropy and present some empirical results on real Web crawl data. Two important conclusions can be drawn:

  1. Using CC with p=1+δ provides a highly efficient algorithm for estimating entropy.
  2. Using symmetric stable random projections (Indyk 2006, Li 2008) with p = 1+δ is impractical because the required storage space (sample size) is disastrously large; and instead, one should exploit the bias-variance trade-off by using p away from 1.