Analytical Model

In this section, we provide a model that analyzes Zipf-like query distributions and provides closed-form optimal replication levels for the objects in order to achieve constant average lookup performance with low storage and bandwidth overhead.

In Zipf-like, or power law, query distributions, the number of queries to the $i^{th}$ most popular object is proportional to $i^{-\alpha}$ , where $\alpha$ is the parameter of the distribution. The query distribution has a heavier tail for smaller values of the parameter $\alpha$ . A Zipf distribution with parameter 0 corresponds to a uniform distribution. The total number of queries to the most popular

objects,

, is approximately $\frac{m^{1-\alpha} - 1}{1 - \alpha}$ for $\alpha \neq 1$ , and $Q(m) \simeq ln(m)$ for $\alpha = 1$ .

Using the above estimate for the number of queries received by objects, we pose an optimization problem to minimize the total number of replicas with the constraint that the average lookup latency is a constant

Let

be the base of the underlying DHT system,

the number of objects, and

the number of nodes in the system. Initially, all

objects in the system are stored only at their home nodes, that is, they are replicated at level

. Let

denote the fraction of objects replicated at level

or lower. From this definition,

, since all objects are replicated at level

most popular objects are replicated at all the nodes in the system.

Each object replicated at level

is cached in

nodes. $Mx_i - Mx_{i-1}$ objects are replicated on nodes that have exactly

matching prefixes. Therefore, the average number of objects replicated at each node is given by $Mx_0 + \frac{M(x_1 - x_0)}{b} + \cdots + \frac{M(x_k - x_{k-1})}{b^k}$ . Consequently, the average per node storage requirement for replication is:

$\displaystyle M [(1 - \frac{1}{b})(x_0 + \frac{x_1}{b} + \cdots + \frac{x_{k-1}}{b^{k-1}}) + \frac{1}{b^k}]$

(1)

The fraction of queries,

, that arrive for the most popular

objects is approximately $\frac{(Mx_i)^{1-\alpha} - 1}{M^{1-\alpha} - 1}$ . The number of objects that are replicated at level

is $Mx_i - Mx_{i-1}, 0 < i \leq k$ . Therefore, the number of queries that travel

hops is $Q(Mx_i) - Q(Mx_{i-1}), 0 < i \leq k$ . The average lookup latency of the entire system can be given by $\sum_{i = 1}^{k} i (Q(Mx_i) - Q(Mx_{i-1}))$ . The constraint on the average latency is $\sum_{i = 1}^{k} i (Q(Mx_i) - Q(Mx_{i-1})) \leq C$ , where

is the required constant lookup performance. After substituting the approximation for

, we arrive at the following optimization problem.

$\displaystyle \mbox{Minimize } x_0 + \frac{x_1}{b} + \cdots + \frac{x_{k-1}}{b^{k-1}} \mbox{, such that}$			(2)
$\displaystyle x_0^{1 - \alpha} + x_1^{1 - \alpha} + \cdots + x_{k-1}^{1 - \alpha} \geq k - C (1 - \frac{1}{M^{1 - \alpha}})$			(3)
$\displaystyle \mbox{and }x_0 \leq x_1 \leq \cdots \leq x_{k-1} \leq 1$			(4)

Note that constraint 4 effectively reduces to $x_{k-1} \leq 1$ , since any optimal solution to the problem with just constraint 3 would satisfy $x_0 \leq x_1 \leq \cdots \leq x_{k-1}$ .

We can use the Lagrange multiplier technique to find an analytical closed-form optimal solution to the above problem with just constraint 3, since it defines a convex feasible space. However, the resulting solution may not guarantee the second constraint, $x_{k-1} \leq 1$ . If the obtained solution violates the second constraint, we can force $x_{k-1}$ to 1 and apply the Lagrange multiplier technique to the modified problem. We can obtain the optimal solution by repeating this process iteratively until the second constraint is satisfied. However, the symmetric property of the first constraint facilitates an easier analytical approach to solve the optimization problem without iterations.

Assume that in the optimal solution to the problem, $x_0 \leq x_1 \leq \cdots \leq x_{k^\prime-1} < 1$ , for some $k^\prime \leq k$ , and $x_{k^\prime} = x_{k^\prime+1} = \cdots = x_{k} = 1$ . Then we can restate the optimization problem as follows:

$\displaystyle \mbox{Minimize } x_0 + \frac{x_1}{b} + \cdots + \frac{x_{k^\prime-1}}{b^{k^\prime-1}} \mbox{, such that}$			(5)
$\displaystyle x_0^{1 - \alpha} + x_1^{1 - \alpha} + \cdots + x_{k^\prime-1}^{1 - \alpha} \geq k^\prime - C^\prime,$			(6)
$\displaystyle \mbox{where } C^\prime = C(1 - \frac{1}{M^{1 - \alpha}})$

$\displaystyle x^*_i = [\frac{d^i(k^\prime - C^\prime)}{1 + d + \cdots + d^{k^\prime-1}}]^{\frac{1}{1-\alpha}}, \forall 0 \leq i < k^\prime$			(7)
$\displaystyle x^*_i = 1, \forall k^\prime \leq i \leq k$			(8)
$\displaystyle \mbox{where } d = b^{\frac{1-\alpha}{\alpha}}$

We can derive the value of k' by satisfying the condition that $x_{k^\prime-1} < 1$ , that is, $\frac{d^{k^\prime-1}(k^\prime - C^\prime)}{1 + d + \cdots + d^{k^\prime-1}} < 1$ .

As an example, consider a DHT with base 32, Zipf parameter 0.9, 10,000 nodes, and 1,000,000 objects. Applying this analytical method to achieve an average lookup time of one hop yields k' = 2, x₀ = 0.001102, x₁ = 0.0519, and x₂ = 1. Thus, the most popular 1102 objects would be replicated at level 0, the next most popular 50814 objects would be replicated at level 1, and all the remaining objects at level 2. The average per node storage requirement of this system would be 3700 objects.

The optimal solution obtained by this model applies only to the case $\alpha < 1$ . For $\alpha > 1$ , the closed-form solution will yield a level of replication that will achieve the target lookup performance, but the amount of replication may not be optimal because the feasible space is no longer convex. For $\alpha = 1$ , we can obtain the optimal solution by using the approximation $Q(m) = \ln m$ and applying the same technique. The optimal solution for this case is as follows:

$\displaystyle x^*_i = \frac{M^{\frac{-C}{k^\prime}}b^i}{b^{\frac{k^\prime-1}{2}}}, \forall 0 \leq i < k^\prime$			(9)
$\displaystyle x^*_i = 1, \forall k^\prime \leq i \leq k$			(10)

This analytical solution has three properties that are useful for guiding the extent of proactive replication. First, the analytical model provides a solution to achieve any desired constant lookup performance. The system can be tailored, and the amount of overall replication controlled, for any level of performance by adjusting C over a continuous range. Since structured DHTs preferentially keep physically nearby hosts in their top-level routing tables, even a large value for C can dramatically speed up end-to-end query latencies [4]. Second, for a large class of query distributions ( $\alpha \leq 1$ ), the solution provided by this model achieves the optimal number of object replicas required to provide the desired performance. Minimizing the number of replicas reduces per-node storage requirements, bandwidth consumption and aggregate load on the network. Finally, k' serves as an upper bound for the worst case lookup time for any successful query, since all objects are replicated at least in level k'.

We make two assumptions in the analytical model: all objects incur similar costs for replication, and objects do not change very frequently. For applications such as DNS, which have essentially homogeneous object sizes and whose update-driven traffic is a very small fraction of the replication overhead, the analytical model provides an efficient solution. Applying the Beehive approach to applications such as the web, which has a wide range of object sizes and frequent object updates, may require an extension of the model to incorporate size and update frequency information for each object. A simple solution to standardize object sizes is to replicate object pointers instead of the objects themselves. While effective and optimal, this approach adds an extra hop to C and violates sub-one hop routing.