CS 3110 Lecture 21
Amortized analysis and dynamic tables

The claim that hash tables give have O(1) performance for lookup and insert is based on the assumption that the number of elements stored in the table is comparable to the number of buckets. If a hash table has many more elements than buckets, the number of elements stored at each bucket will become large. For instance, with a constant number of buckets and O(n) elements, the lookup time is O(n) and not O(1).

The this problem is relatively simple: the array must be increased in size as the number of elements in the hash table increases. However in doing so all the elements must be rehashed into the new buckets, thus growing a hash table is not constant time, but rather takes time linear in the number of elements at the time the table is grown. If we let the load factor be the ratio of the number of elements to the number of buckets, generally when the load factor is more than some small constant, such as 2, the table is grown by a multiplicative factor, e.g., doubled.

The linear running time of a resizing operation isn't as much of a problem as it might sound, though it can be an issue for some real-time computing systems. If the bucket array is doubled in size every time it is needed, then the insertion of n elements in a row into an empty array takes only O(n) time, perhaps surprisingly. We say that add has O(1) amortized run time because the time required to insert an element is O(1) on the average even though some elements trigger a lengthy rehashing of all the elements of the hash table.

Notice that it is crucial that the array size grows geometrically (doubling). It might be tempting to grow the array by a fixed increment (e.g., 100 elements at time), but this results in asymptotic linear rather than constant amortized running time.

Now we turn to a more detailed description of amortized analysis.

Amortized analysis

Amortized analysis is a worst-case analysis of a a sequence of operations - to obtain a tighter bound on the overall or average cost per operation in the sequence than is obtained by separately analyzing each operation in the sequence. For instance when we considered the union and find operations for the disjoint set data abstraction earlier in the semester, we were able to bound the running time of individual operations by O(lg n). However, for a sequence of n operations it is possible to obtain tighter than an O(n lg n) bound (although that analysis is more appropriate to 4820 than to this course). Here we will consider a simplified version of the hash table problem above, and show that a sequence of n insert operations has overall time O(n).

There are three techniques used for amortized analysis

Consider a table that can store an arbitrary number of integers. For simplicity, let each insertion operation insert at the end of the table. If there are no empty cells left at the end of the table, than a new table of double the size is created, and all the data from the old table is copied to the corresponding entries in the new table. For instance consider the following sequence of insertions:

           +--+
Insert 11  |11|
           +--+
           +--+--+
Insert 12  |11|12|
           +--+--+
           +--+--+--+--+
Insert 13  |11|12|13|  |
           +--+--+--+--+
           +--+--+--+--+
Insert 14  |11|12|13|14|
           +--+--+--+--+
           +--+--+--+--+--+--+--+--+
Insert 15  |11|12|13|14|15|  |  |  |
           +--+--+--+--+--+--+--+--+

As each insertion take O(n) time in the worst case, a simple analysis yields a bound of O(n2) time for n insertions. Let's look at an aggregate amortized analysis.

Let ci be the cost of the i-th insertion:

ci = i if i-1 is a power of 2
     1 otherwise

Let's consider the size of the table si and the cost ci for the first few insertions in a sequence:

i   1  2  3  4  5  6  7  8  9 10
si  1  2  4  4  8  8  8  8 16 16
ci  1  2  3  1  5  1  1  1  9  1

Alteratively we can see that ci=1+di where di is the cost of doubling the table size. That is

di = i-1 if i-1 is a power of 2
     0 otherwise
Then summing over the entire sequence, all the 1's sum to O(n), and all the di also sum to O(n). That is,
Σni=1  ci   ≤   n + Σmj=0  2j-1 

where m=lg(n-1), both terms on the right hand side of the inequality are O(n) so the total running time of n insertions is O(n).

Accounting method. In contrast with the aggregate method, which directly seeks a bound on the overall running time of an operation sequence, the accounting method seeks to find a ''payment'' for each individual operation, such that the sum of the payments for the operations performed is always at least as large as the total actual cost of those operations. Intuitively one can think of maintaining a bank account where the charges for each operation are deposited into the account and the actual cost of doing each operation is subtracted from the account. The charges must be set just large enough that the balance always remains positive.

If we let c'i be the charge for the i-th operation then

Σni=1 ci ≤ Σni=1 c'i

for all values of n.

Back to the example of the dynamic table, clearly a charge of 1 per insertion is not enough to cover the actual insertion costs, because even the first doubling of the array costs 2. A charge of 2 per insertion again is not enough, but a charge of 3 appears to be:

i   1  2  3  4  5  6  7  8  9 10
si  1  2  4  4  8  8  8  8 16 16
ci  1  2  3  1  5  1  1  1  9  1
c'i 3  3  3  3  3  3  3  3  3  3
bi  2  3  3  5  3  5  7  9  3  4

where bi is the balance after the i-th insertion.

In fact we can see that this is enough in general, as each insertion can be charged three units, 1 to insert it immediately in the table and 2 to cover the costs of copying when the table next needs to be grown. To see that the copying charge is 2 per insertion, note that when a copy happens there are twice as many elements to copy as have been added since the last time the table grew (because the table doubles in size). Thus we have deposited the charges from n/2 elements when n elements need to be copied.

In fact we can do slightly better, by charging just 1 for the first insertion and then 3 for each insertion after that, because for the first insertion there are no elements to copy. This will yield a zero balance after the first insertion and then a positive one thereafter. However this does not change the asymptotic running time for n insertions, which is O(n) because there is a constant charge of 3 per insertion. For the third technique, the potential method (or physicist's method), see the notes fo the next recitation.