Analysis of Closed Hashing - Data Structures and Algorithms Alfred V Aho pdf

In a closed hashing scheme the speed of insertion and other operations depends not only on how randomly the hash function distributes the elements into buckets, but also on how well the rehashing strategy avoids additional collisions when a bucket is already filled. For example, the linear strategy for resolving collisions is not as good as possible. While the analysis is beyond the scope of this book, we can observe the following. As soon as a few consecutive buckets are filled, any key that hashes to one of them will be sent, by the rehash strategy, to the end of the group, thereby

increasing the length of that consecutive group. We are thus likely to find more long runs of consecutive buckets that are filled than if elements filled buckets at random. Moreover, runs of filled blocks cause some long sequences of tries before a newly inserted element finds an empty bucket, so having unusually large runs of filled

buckets slows down insertion and other operations.

We might wish to know how many tries (or probes) are necessary on the average to insert an element when N out of B buckets are filled, assuming all combinations of

N out of B buckets are equally likely to be filled. It is generally assumed, although not

proved, that no closed hashing strategy can give a better average time performance than this for the dictionary operations. We shall then derive a formula for the cost of insertion if the alternative locations used by the rehashing strategy are chosen at random. Finally, we shall consider some strategies for rehashing that approximate random behavior.

The probability of a collision on our initial probe is N/B. Assuming a collision, our first rehash will try one of B - 1 buckets, of which N - 1 are filled, so the probability of at least two collisions is Similarly, the probability of at least i collisions is

If B and N are large, this probability approximates (N/B)i. The average number of probes is one (for the successful insertion) plus the sum over all i ≥ 1 of the probability of at least i collisions, that is, approximately . It can

be shown that the exact value of the summation when formula (4.3) is substituted for (N/B)i_is _{, so our approximation is a good one except when N is very close to}

Observe that grows very slowly as N begins to grow from 0 to B - 1, the largest value of N for which another insertion is possible. For example, if N is half B, then about two probes are needed for the next insertion, on the average. The average insertion cost per bucket to fill M of the B buckets is , which is approximately , or Thus, to fill the table completely (M = B) requires an average of log_eB per bucket, or B log_eB probes in total. However,

to fill the table to 90% of capacity (M = .9B) only requires B((10/9) log_el0) or approximately 2.56B probes.

the cost of inserting the next element, but the cost of the membership test for an element in the set is the average cost of all insertions made so far, which is

substantially less if the table is fairly full. Deletions have the same average cost as membership testing. But unlike open hashing, deletions from a closed hash table do not help speed up subsequent insertions or membership tests. It should be emphasized that if we never fill closed hash tables to more than any fixed fraction less than one, the average cost of operations is a constant; the constant grows as the permitted fraction of capacity grows. Figure 4.15 graphs the cost of insertions, deletions and membership tests, as a function of the percentage of the table that is full at the time the operation is performed.

Fig. 4.15. Average operation cost.

"Random" Strategies for Collision

Resolution

We have observed that the linear rehashing strategy tends to group full buckets into large consecutive blocks. Perhaps we could get more "random" behavior if we probed at a constant interval greater than one. That is, let h_i(x) = (h(x)+ci) mod B for some c > 1. For example, if B = 8, c = 3, and h(x) = 4, we would probe buckets 4, 7, 2, 5, 0, 3, 6, and 1, in that order. Of course, if c and B have a common factor greater than one, this strategy doesn't even allow us to search all buckets; try B = 8 and c = 2, for example. But more significantly, even if c and B are relatively prime (have no common factors), we have the same "bunching up" problem as with linear hashing, although here it is sequences of full buckets separated by difference c that tend to occur. This phenomenon slows down operations as for linear hashing, since an attempted insertion into a full bucket will tend to travel down a chain of full buckets separated by distance c, and the length of this chain will increase by one.

In fact, any rehashing strategy where the target of a probe depends only on the target of the previous probe (as opposed to depending on the number of unsuccessful probes so far, the original bucket h(x), or the value of the key x itself) will exhibit the bunching property of linear hashing. Perhaps the simplest strategy in which the

problem does not occur is to let h_i(x) = (h(x)+d_i) mod B where d₁, d₂, . . . , d_B-1 is a "random" permutation of the integers 1, 2, . . . , B-1. Of course, the same sequence d₁,

. . . , d_B-1 is used for all insertions, deletions and membership tests; the "random" shuffle of the integers is decided upon once, when we design the rehash algorithm. The generation of "random" numbers is a complicated subject, but fortunately, many common methods do produce a sequence of "random" numbers that is actually a permutation of integers from 1 up to some limit. These random number generators, if reset to their initial values for each operation on the hash table, serve to generate the desired sequence d₁, . . . , d_B-1.

One effective approach uses "shift register sequences." Let B be a power of 2 and k a constant between 1 and B-1. Start with some number d₁ in the range 1 to B - 1, and generate successive numbers in the sequence by taking the previous value, doubling it, and if the result exceeds B, subtracting B and taking the bitwise modulo 2 sum of the result and the selected constant k. The bitwise modulo 2 sum of x and y, written x

⊕ y, is computed by writing x and y in binary, with leading 0's if necessary so both are of the same length, and forming the numbers whose binary representation has a 1 in those positions where one, but not both, of x and y have a 1.

Example 4.7. 25 ⊕ 13 is computed by taking

25 = 11001 13 = 01101 ______ 25 ⊕ 13 = 10100

Note that this "addition" can be thought of as ordinary binary addition with carries from place to place ignored.

Not every value of k will produce a permutation of 1, 2, . . . , B-1; sometimes a number repeats before all are generated. However, for given B, there is a small but finite chance that any particular k will work, and we need only find one k for each B.

Example 4.8. Let B = 8. If we pick k = 3, we succeed in generating all of 1, 2, . . . , 7.

For example, if we start with d₁ = 5, then we compute d₂ by first doubling d₁ to get 10. Since 10 > 8, we subtract 8 to get 2, and then compute d₂ = 2 ⊕ 3 = 1. Note that x

It is instructive to see the 3-bit binary representations of d₁, d₂, . . . , d₇. These are shown in Fig. 4.16, along with the method of their calculation. Note that

multiplication by 2 corresponds to a shift left in binary. Thus we have a hint of the origin of the term "shift register sequence."

Fig. 4.16. Calculating a shift register sequence.

The reader should check that we also generate a permutation of 1, 2, . . . ,7 if we choose k = 5, but we fail for other values of k.

In document Data Structures and Algorithms Alfred V Aho pdf (Page 159-163)