Chapter Objectives. Chapter 9. Sequential Search. Search Algorithms. Search Algorithms. Binary Search

(1)

Data Structures Using C++ 1

Chapter 9

Search Algorithms

Chapter Objectives

• Learn the various search algorithms

• Explore how to implement the sequential and binary search algorithms

• Discover how the sequential and binary search algorithms perform

• Become aware of the lower bound on comparison-based search algorithms

• Learn about hashing

Sequential Search

template<class elemType>

int arrayListType<elemType>::seqSearch(const elemType& item) {

int loc;

bool found = false;

for(loc = 0; loc < length; loc++) if(list[loc] == item) {

found = true;

break;

} if(found)

return loc;

else return -1;

}//end seqSearch

What is the time complexity?

Search Algorithms

• Search item: target

• To determine the average number of comparisons in the successful case of the sequential search algorithm:

– Consider all possible cases

– Find the number of comparisons for each case – Add the number of comparisons and divide by

the number of cases

Search Algorithms

Suppose that there are n elements in the list. The following expression gives the average number of comparisons, assuming that each element is equally likely to be sought:

It is known that

Therefore, the following expression gives the average number of comparisons made by the sequential search in the successful case:

Binary Search

(assumes list is sorted)

(2)

Binary Search: middle element

first + last mid = 2

Binary Search

template<class elemType>

int orderedArrayListType<elemType>::binarySearch (const elemType& item) {

int first = 0;

int last = length - 1;

int mid;

bool found = false;

while(first <= last && !found) {

mid = (first + last) / 2;

if(list[mid] == item) found = true;

else

if(list[mid] > item) last = mid - 1;

else

first = mid + 1;

} if(found)

return mid;

else return –1;

}//end binarySearch

Binary Search: Example

• Unsuccessful search

• Total number of comparisons is 6

Performance of Binary Search Performance of Binary Search

(3)

Performance of Binary Search

• Unsuccessful search

– for a list of length n, a binary search makes approximately 2 log₂(n + 1) key comparisons

• Successful search

– for a list of length n, on average, a binary search makes 2 log₂n – 4 key comparisons

• Worst case upper bound: 2 + 2 log₂n

Search Algorithm Analysis Summary

Lower Bound on Comparison- Based Search

• Definition: A comparison-based search algorithm performs its search by repeatedly comparing the target element to the list elements.

• Theorem: Let L be a list of size n > 1. Suppose that the elements of L are sorted. If SRH(n) denotes the minimum number of comparisons needed, in the worst case, by using a comparison-based algorithm to recognize whether an element x is in L, then SRH(n) = log₂(n + 1).

– If list not sorted, worst case is n comparisons

• Corollary: The binary search algorithm is the optimal worst-case algorithm for solving search problems by the comparison method (when the list is sorted).

– For unsorted lists, sequential search is optimal

Hashing

• An alternative to comparison-based search

• Requires storing data in a special data structure, called a hash table

• Main objectives to choosing hash functions:

– Choose a hash function that is easy to compute – Minimize the number of collisions

Commonly Used Hash Functions

• Mid-Square

– Hash function, h, computed by squaring the identifier – Using appropriate number of bits from the middle of

the square to obtain the bucket address – Middle bits of a square usually depend on all the

characters, it is expected that different keys will yield different hash addresses with high probability, even if some of the characters are the same

Commonly Used Hash Functions

• Folding

– Key X is partitioned into parts such that all the parts, except possibly the last parts, are of equal length – Parts then added, in convenient way, to obtain hash

address

• Division (Modular arithmetic) – Key X is converted into an integer iX – This integer divided by size of hash table to get

remainder, giving address of X in HT

(4)

Commonly Used Hash Functions

Suppose that each key is a string. The following C++ function uses the division method to compute the address of the key:

int hashFunction(char *key, int keyLength) {

int sum = 0;

for(int j = 0; j <= keyLength; j++) sum = sum + static_cast<int>(key[j]);

return (sum % HTSize);

}//end hashFunction

Collision Resolution

• Algorithms to handle collisions

• Two categories of collision resolution techniques

– Open addressing (closed hashing) – Chaining (open hashing)

Collision Resolution:

Open Addressing

Pseudocode implementing linear probing:

hIndex = hashFunction(insertKey);

found = false;

while(HT[hIndex] != emptyKey && !found) if(HT[hIndex].key == key)

found = true;

else

hIndex = (hIndex + 1) % HTSize;

if(found)

cerr<<”Duplicate items are not allowed.”<<endl;

else

HT[hIndex] = newItem;

Linear Probing

• 9 will be next location if h(x) = 6,7,8, or 9

Probability of 9 being next = 4/20, for 14, it’s 5/20, but only 1/20 for 0 or 1

Clustering

Random Probing

• Uses a random number generator to find the next available slot

• ith slot in the probe sequence is: (h(X) + r

_i

)

% HTSize where r

_i

is the ith value in a random permutation of the numbers 1 to HTSize – 1

• All insertions and searches use the same sequence of random numbers

Quadratic Probing

• ith slot in the probe sequence is: (h(X) + i²) % HTSize (start i at 0)

• Reduces primary clustering of linear probing

• We do not know if it probes all the positions in the table

• When HTSize is prime, quadratic probing probes about half the table before repeating the probe sequence

(5)

Deletion: Open Addressing

• When deleting, need to remove the item from its spot, but cannot reset it to empty (Why?)

Deletion: Open Addressing

• IndexStatusList[i] set to –1 to mark item i as deleted

Collision Resolution:

Chaining (Open Hashing)

• No probing needed; instead put linked list at each hash

position Data Structures Using C++ 28

Hashing Analysis

Let

Then a is called the load factor

Average Number of Comparisons

• Linear probing:

• Successful search:

• Unsuccessful search:

• Quadratic probing:

• Successful search:

• Unsuccessful search:

Chaining:

Average Number of Comparisons

1. Successful search

2. Unsuccessful search

(6)

1

Sear ching L ists

aremanyinstanceswhenoneisinterested andsearchingalist: phonecompanywantstoprovidecallerID: venaphonenumberanameisreturned. whoplaystheLottothinkstheycan theirchancetowiniftheykeeptrack thewinningnumberfromthelast3years. thesehavesimplesolutionsusingarrays, lists,binarytrees,etc. ,noneofthesesolutionsisadequate, willsee.

Hashing2

Pr oblematic L ist Sear ching

Herearesomesolutionstotheproblems: •ForcallerID: –Useanarrayindexedbyphonenumber. ∗Requiresoneoperationtoreturnname. ∗Anarrayofsize1,000,000,000isneeded. –Usealinkedlist. ∗ThisrequiresO(n)timeandspace,wheren isthenumberofactualphonenumbers. ∗Thisisthebestwecandospace-wise,but thetimerequiredishorrendous. –Abalancedbinarytree:O(n)spaceand O(logn)time.Better,butnotgoodenough. •FortheLottowecouldusethesamesolutions. –Thenumberofpossiblelottonumbersis 15,625,000fora6/50lotto. –Thenumberofentriesstoredisontheorderof 1000,assumingadailylotto.

s

Hashing3

The H ash T able Solution

•AHashtableissimilartoanarray: –Hasfixedsizem. –Anitemwithkeykgoesintoindexh(k), wherehisafunctionfromthekeyspaceto {0,1,...,m−1} •Thefunctionhiscalledahashfunction. •Wecallh(k)thehashvalueofk. •Ahashtableallowsustostorevaluesfroma largesetinasmallarrayinsuchawaythat searchingisfast. •Thespacerequiredtostorennumbersinahash tableofsizemisO(m+n).Noticethatthisdoes notdependonthesizeofthekeyspace. •Theaveragetimerequiredtoinsert,delete,and searchforanelementinahashtableisO(1). •Soundsperfect.Thenwhat’stheproblem?Let’s lookatanexample. 4

Hash T a ble E xample

5friendswhosephonenumbersIwantto ahashtableofsize8.Theirnamesand are: SusyOlson555-1212 SarahSkillet555-4321 RyanHamster545-3241 JustinCase555-6798 ChrisLindmeyer535-7869 ahashfunctionh(k)=kmod8,wherek phonenumberviewedasa7-digitdecimal . that mod8=5453241mod8=1. tputSarahSkilletandRyanHamster theposition1. efixthisproblem?

Hashing5

Hash T a ble P ro blems

•Theproblemwithhashtablesisthattwokeyscan havethesamehashvalue.Thisiscalledcollision. •Thereareseveralwaystodealwiththisproblem: –Pickthehashfunctionhtominimizethe numberofcollisions. –Implementthehashtableinawaythatallows keyswiththesamehashvaluetoallbestored. •Thesecondmethodisalmostalwaysneeded,even forverygoodhashfunctions.Why? •Wewilltalkabout2collisionresolution techniques: –Chaining –OpenAddressing •Butfirst,abitmoreonhashfunctions

Hashing6

Hash Functions

•Mosthashfunctionsassumethekeyscomefrom theset{0,1,...}. •Ifthekeysarenotnaturalnumbers,somemethod mustbeusedtoconvertthem. –PhonenumberandIDnumberscanbe convertedbyremovingthehyphens. –CharacterscanbeconvertedusingASCII. –Stringscanbeconvertedbyconvertingeach characterusingASCII,andtheninterpreting thestringofnaturalnumbersasifitwhere storedbase128. •Weneedtochooseagoodhashfunction. •Ahashfunctionisgoodif –itcanbecomputedquickly,and –thekeysaredistributeduniformlythroughout thetable.

(7)

7

Good Hash Functions

method: h(k)=kmodm mmustbechosencarefully. of2and10canbebad.Why? nottooclosetopowersof2 choice. pickmbychoosinganappropriate thatisclosetothetablesizewe method:LetAbeaconstant 1.Thenweuse h(k)=m(kAmod1), mod1”meansthefractionalpartof ofmisnotascriticalhere. choosemtomaketheimplementation fast.

Hashing8

Uni v ersal Hashing

•Letxandybedistinctkeys. •LetHbeasetofhashfunctions. •Hiscalleduniversalifthenumberoffunctions h∈Hforwhichh(x)=h(y)isprecisely|H|/m. •Inotherwords,ifwepickarandomh∈H,the probabilitythatxandycollideunderhis1/m. •Universalhashingcanbeusefulinmany situations. –Youareaskedtocomeupwithahashing techniqueforyourboss. –Yourbosstellsyouthatafteryouaredone,he willpicksomekeystohash. –Ifyougettoomanycollisions,hewillfireyou. –Ifyouuseuniversalhashing,theonlywayhe cansucceedisbygettinglucky.Why?

Hashing9

Collision R esolution: Chaining

•Withchaining,wesetupanarrayoflinks, indexedbythehashvalues,tolistsofitemswith thesamehashvalue. 13465

5 21

320 1 2 3 4 •Letnbethenumberofkeys,andmthesizeofthe hashtable.Definetheloadfactorα=n/m. •Successfulandunsuccessfulsearchingbothtake timeΘ(1+α)onaverage,assumingsimple uniformhashing. •Bysimpleuniformhashingwemeanthatakey hasequalprobabilitytohashintoanyofthem slots,independentoftheotherelementsofthe table. 10

esolution: Open Addr essing

Addressing,allelementsarestoredin severalthings isusedthanwithchainingsince thavetostorepointers. tablehasanabsolutesizelimitofm. lanningaheadisimportantwhenusing haveawaytostoremultipleelements samehashvalue. hashfunction,weneedtouseaprobe Thatis,asequenceofhashvalues. thesequenceonebyoneuntilwe position. wedothesamething,skipping donotmatchourkey.

Hashing11

Pr obe S equences

•Wedefineourhashfunctionswithanextra parameter,theprobenumberi.Thusourhash functionslooklikeh(k,i). •Wecomputeh(k,0),h(k,1),...,h(k,i)until h(k,i)isempty. •Thehashfunctionhmustbesuchthattheprobe sequenceh(k,0),...h(k,m−1)isa permutationof0,1,...,m−1.Why? •Insertionandsearchingarefairlyquick,assuming agoodimplementation. •Deletioncanbeaproblembecausetheprobe sequencecanbecomplex. •Wewilldiscuss2waysofdefiningprobe sequences: –LinearProbing –DoubleHashing

Hashing12

Linear Pr obing

•Letgbeahashfunction. •Whenweuselinearprobing,theprobesequence iscomputedby h(k,i)=(g(k)+i)modm. •Whenweinsertanelement,westartwiththehash value,andproceedelementbyelementuntilwe findanemptyslot. •Forsearching,westartwiththehashvalueand proceedelementbyelementuntilwefindthekey wearelookingfor. •Example:Letg(k)=kmod13.Wewillinsert thefollowingkeysintothehashtable: 18,41,22,44,59,32,31,73 5432161211109870 •Problem:Thevaluesinthetabletendtocluster.

(8)

13

Double H ashing

hashingweusetwohashfunctions. andh2behashfunctions. ourprobesequenceby h(k,i)=(h1(k)+i×h2(k))modm. pickm=2nforsomen,andensurethat isalwaysodd. hashingtendstodistributekeysmore thanlinearprobing.

Hashing14

Double H ashing Example

•Leth1=kmod13 •Leth2=1+(kmod8),and •Letthehashtablehavesize13. •Thenourprobesequenceisdefinedby h(k,i)=(h1(k)+i×h2(k))mod13 =(kmod13+i(1+(kmod8)))mod13 •Insertthefollowingkeysintothetable: 18,41,22,44,59,32,31,73 5432161211109870

Hashing15

Open Addr essing: P erf ormance

•Again,wesetα=n/m.Thisistheaverage numberofkeysperarrayindex. •Notethatα≤1foropenaddressing. •Weassumeagoodprobesequencehasbeenused. Thatis,foranykey,eachpermutationof 0,1...,m−1isequallylikelyasaprobe sequence. •Theaveragenumberofprobesforinsertionor unsuccessfulsearchisatmost1/(1−α). •Theaveragenumberofprobesforasuccessful searchisatmost 1 αln1 1−α+1 α. 16

OA: SS OA: I/US Ch: S Ch: I

0 5 10 15 20 0 0.2

0.4 0.6

0.8 1

Hashing Performance: Chaining Verses Open Addressing

1 1+x 1/(1-x) (1/x)*log(1/(1-x))+1/x