Popularity and Zipf-Parameter Estimation

Next: Replication Protocol Up: The Beehive System Previous: Analytical Model

Popularity and Zipf-Parameter Estimation

The analytical model described in the previous section requires estimates of the Zipf parameter of the query distribution and the relative popularities of the objects. Beehive employs a combination of local measurement and limited aggregation to keep track of these parameters and adapt the replication appropriately.

Beehive nodes locally measure the number of queries received for each object on that node. For non-replicated objects, the count at the home node reflects the popularity of that object. However, queries for an object replicated at level i are evenly distributed across approximately N/bⁱ nodes. In order to estimate popularity of such an object as accurately as an unreplicated object, one would need an N/bⁱ-fold increase in the measurement interval. Since dilating the sampling interval would prevent the system from reacting quickly to changes in object popularity, Beehive aggregates popularity data from multiple nodes to arrive at accurate estimates of object popularity within a relatively short sampling interval.

Aggregation in Beehive takes place periodically, once every aggregation interval. Each node A sends to node B in the i^th level of its routing table an aggregation message containing the number of accesses of each object replicated at level i or lower and having i+1 matching prefixes with B. When node B receives these messages from all nodes at level i, it aggregates the access counts and sends the result to all nodes in the (i+1)^th level of its routing table. This process allows popularity data to flow towards the home node.

A similar process operates in reverse to disseminate the aggregated totals from the higher levels towards the nodes at the leaves. A node at level i+1 sends the latest aggregated estimate of access counts to nodes at level i for that object. This exchange occurs when the higher level node is contacted by lower level node for aggregation. For an object replicated at level i, it takes 2(logN-i) rounds of aggregation to complete the information flow from the leaves up to the home node and back.

The overall Zipf-parameter is also estimated in a similar manner. Each node locally estimates the Zipf-parameter using linear regression to compute the slope of the best fit line, since a Zipf-like popularity distribution is a straight line in log-scale. Beehive nodes then refine this estimate by averaging it with the local estimates of other nodes they communicate with during aggregation.

There will be fluctuations in the estimation of access frequency and the Zipf parameter due to temporal variations in the query distribution. In order to avoid large discontinuous changes to an estimate, we age the estimate using exponential decay.

Next: Replication Protocol Up: The Beehive System Previous: Analytical Model

beehive-l@cs.cornell.edu 2004-02-11