> Instead use a second, different hash function (h2(x)) to determine the increment

(1)

Double Hashing

• Double Hashing

– Idea: When a collision occurs, increment the index

(mod tablesize), just as in linear probing. However,

now do not automatically choose 1 as the increment

value

> Instead use a second, different hash function (h2(x)) to determine the increment

– This way keys that hash to the same location

will likely not have the same increment

> h1(x1) == h1(x2) with x1 != x2 is bad luck (assuming a good hash function)

> However, ALSO having h2(x1) == h2(x2) is REALLY bad luck, and should occur even less frequently

> It also allows for a collided key to move (mostly – depending on h2(x)) anywhere in the table

– See example on next slide

(2)

Double Hashing Example

Index Value Probes 0

1 2 3 4 5 6 7 8 9 10

14

17

25

37

34

16 h(x) = x mod 11

h (x) = (x mod 7) + 1

26

1

2

1

2 h(x) = 3

h(x) = 4

Compare to Slide 18 of Lecture 5

h2(x) = 5

h2(x) = 6

h(x) = 4

h(x) = 6

h(x) = 3

h(x) = 1

h(x) = 5

(3)

Double Hashing

• Note that we still get collisions with DH

– And even multiple collisions in one operation

– In this case we iterate just as we do with LP, using

the DH increment multiple times

• However, because h2(x) varies for different keys, it

allows us to spread the data throughout the table,

even after an initial collision

• But we must be careful to ensure that double

hashing always "works"

– Make sure increment is > 0

> Note the +1 in our h2(x): h2(x) = (x mod 7) + 1

> Our mod operator can result in 0, which is fine for an absolute address, but not for an increment!

(4)

Double Hashing

– Make sure no index is tried twice before

all are tried once

> Why? Think about this?

> Consider table to right and assume:

> h(Z) = 3 and h2(Z) = 2

> What would happen when we search the table?

> How can we fix this?

> Make M a prime number

– Note that these were not issues for

linear probing, since the increment is

clearly > 0 and if our increment is 1 we

will clearly try all indices once before

trying any twice

Index Value 0

1 V

2

3 W

4

5 X

6

7 Y

(5)

Collision Resolution

• As  increases, double hashing shows a definite

improvement over linear probing

– Discuss

• However, as   1 (or as N  M), both schemes

degrade to Theta(N) performance

– Since there are only M locations in the table, as it fills

there become fewer empty locations remaining

– Multiple collisions will occur even with double hashing

– This is especially true for inserts and unsuccessful finds

> Both of these continue until an empty location is found, and few of these exist

> Thus it could take close to M probes before the collision is resolved

> Since the table is almost full Theta(M) = Theta(N)

(6)

Open Addressing Issues

 Open Addressing Issues

• We have just seen that performance degrades as N

approaches M

– Typically for open addressing we want to keep the table

partially empty

> For linear probing,  = 1/2 is a good rule of thumb

> For double hashing, we can go a bit higher (3/4 or more)

– How can we do this?

> Monitor the logical size (number of entries) vs. physical size (array length) to calculate 

> Resize the array and rehash all of the values when  gets past the threshold

> Rehashing all of the data seems like a LOT of work!

> Is this better than leaving it as is?

> We will discuss

(7)

Open Addressing Issues

• What about delete?

– Why is this a problem?

– Consider the LP table on the right

and assume H(Z) == 2 but it was

placed in index 4 due to a

collision

– Search for Z would try 2, 3, 4,

finding Z at location 4

– Now delete(Y) and search for Z

again

> Search would stop at index 3 with not found even though Z is present

– Deleting Y broke the chain

• How can we fix this?

Index Value 0

1 W

2 X

3 Y

4 Z

5 6 7

(8)

Index Value 0

1 W

2 X

3 4 5 6 7

Open Addressing Issues

– One solution (see p. 471 of text)

> Rehash all keys from deleted key to end of cluster

> Note that in this case Z still hashes to 2 and will move to position 3 and once again be within the chain

> Will this be a lot of work?

> Discuss

– Will not work with double hashing

though – why?

– What can we do with double

hashing?

> Discuss

Y

Z

(9)

Open Addressing Issues

– Can we use hashing without delete?

> Yes, in some cases (ex: compiler using language keywords)

> We build a hash table, use it for searches, and then throw it away entirely

> We never delete individual items

(10)

Closed Addressing

• Closed Addressing

 Recall that in this scheme, each location in the

hash table represents a collection of data

• If we have a collision we resolve it within the

collection, without changing hash addresses

 Most common form is separate chaining

• Use a simple linked-list at each location in the table

– Look at example

> Using the same data that we previously used for linear probing and separate chaining

– Discuss placement of nodes in chain

(11)

Separate Chaining

Index Value 0

1 2 3 4 5 6 7 8 9 10

14

17

25

37

34

16

26 h(x) = 3

h(x) = 4

h(x) = 6

h(x) = 3

h(x) = 1

h(x) = 5

h(x) = x mod 11

(12)

Separate Chaining

 Performance of separate chaining?

• Performance is dependent upon chain length

• Clearly a not found search must traverse entire chain

– Chain length is determined by the load factor, 

> Ave chain length = (total # of nodes)/(M)

> But (total # of nodes) == N so

> Ave chain length == N/M = 

– As long as  is a small constant, performance is still

Theta(1)

> Ex: N = 150, M = 100   = 1.5

> This is still clearly Theta(1)

> Note also that N can now be greater than M

> More graceful degradation than open addressing schemes

(13)

Separate Chaining

– However, if N >> M, then it can still degrade to

Theta(N) performance

> Ex: N = 1000, M = 10   = 100

> Thus we may still need to resize the array when  gets too big

– A poor hash function can also degrade this into

Theta(N)

> Think about what will happen in this case

> Discuss

• Can we develop a closed addressing scheme that

can mitigate the damage caused by a poor hash

function?

– Think about this!

(14)

Collision Resolution

 What if we used "better" collections at each index?

• Sorted array?

– Space overhead if we make it large and copying overhead

if we need to resize it

– Inserts require shifting

• BST?

– Could work

> Now a poor hash function would lead to a large tree at one index – still Theta(logN) as long as tree is relatively