Data Structures Using C++ 1
Chapter 9
Search Algorithms
Data Structures Using C++ 2
Chapter Objectives
• Learn the various search algorithms
• Explore how to implement the sequential and binary search algorithms
• Discover how the sequential and binary search algorithms perform
• Become aware of the lower bound on comparison-based search algorithms
• Learn about hashing
Data Structures Using C++ 3
Sequential Search
template<class elemType>
int arrayListType<elemType>::seqSearch(const elemType& item) {
int loc;
bool found = false;
for(loc = 0; loc < length; loc++) if(list[loc] == item) {
found = true;
break;
} if(found)
return loc;
else return -1;
}//end seqSearch
What is the time complexity?
Data Structures Using C++ 4
Search Algorithms
• Search item: target
• To determine the average number of comparisons in the successful case of the sequential search algorithm:
– Consider all possible cases
– Find the number of comparisons for each case – Add the number of comparisons and divide by
the number of cases
Search Algorithms
Suppose that there are n elements in the list. The following expression gives the average number of comparisons, assuming that each element is equally likely to be sought:
It is known that
Therefore, the following expression gives the average number of comparisons made by the sequential search in the successful case:
Binary Search
(assumes list is sorted)
Data Structures Using C++ 7
Binary Search: middle element
first + last mid = 2
Data Structures Using C++ 8
Binary Search
template<class elemType>
int orderedArrayListType<elemType>::binarySearch (const elemType& item) {
int first = 0;
int last = length - 1;
int mid;
bool found = false;
while(first <= last && !found) {
mid = (first + last) / 2;
if(list[mid] == item) found = true;
else
if(list[mid] > item) last = mid - 1;
else
first = mid + 1;
} if(found)
return mid;
else return –1;
}//end binarySearch
Data Structures Using C++ 9
Binary Search: Example
Data Structures Using C++ 10
Binary Search: Example
• Unsuccessful search
• Total number of comparisons is 6
Performance of Binary Search Performance of Binary Search
Data Structures Using C++ 13
Performance of Binary Search
• Unsuccessful search
– for a list of length n, a binary search makes approximately 2 log2 (n + 1) key comparisons
• Successful search
– for a list of length n, on average, a binary search makes 2 log2 n – 4 key comparisons
• Worst case upper bound: 2 + 2 log2n
Data Structures Using C++ 14
Search Algorithm Analysis Summary
Data Structures Using C++ 15
Lower Bound on Comparison- Based Search
• Definition: A comparison-based search algorithm performs its search by repeatedly comparing the target element to the list elements.
• Theorem: Let L be a list of size n > 1. Suppose that the elements of L are sorted. If SRH(n) denotes the minimum number of comparisons needed, in the worst case, by using a comparison-based algorithm to recognize whether an element x is in L, then SRH(n) = log2 (n + 1).
– If list not sorted, worst case is n comparisons
• Corollary: The binary search algorithm is the optimal worst-case algorithm for solving search problems by the comparison method (when the list is sorted).
– For unsorted lists, sequential search is optimal
Data Structures Using C++ 16
Hashing
• An alternative to comparison-based search
• Requires storing data in a special data structure, called a hash table
• Main objectives to choosing hash functions:
– Choose a hash function that is easy to compute – Minimize the number of collisions
Commonly Used Hash Functions
• Mid-Square
– Hash function, h, computed by squaring the identifier – Using appropriate number of bits from the middle of
the square to obtain the bucket address – Middle bits of a square usually depend on all the
characters, it is expected that different keys will yield different hash addresses with high probability, even if some of the characters are the same
Commonly Used Hash Functions
• Folding
– Key X is partitioned into parts such that all the parts, except possibly the last parts, are of equal length – Parts then added, in convenient way, to obtain hash
address
• Division (Modular arithmetic) – Key X is converted into an integer iX – This integer divided by size of hash table to get
remainder, giving address of X in HT
Data Structures Using C++ 19
Commonly Used Hash Functions
Suppose that each key is a string. The following C++ function uses the division method to compute the address of the key:
int hashFunction(char *key, int keyLength) {
int sum = 0;
for(int j = 0; j <= keyLength; j++) sum = sum + static_cast<int>(key[j]);
return (sum % HTSize);
}//end hashFunction
Data Structures Using C++ 20
Collision Resolution
• Algorithms to handle collisions
• Two categories of collision resolution techniques
– Open addressing (closed hashing) – Chaining (open hashing)
Data Structures Using C++ 21
Collision Resolution:
Open Addressing
Pseudocode implementing linear probing:
hIndex = hashFunction(insertKey);
found = false;
while(HT[hIndex] != emptyKey && !found) if(HT[hIndex].key == key)
found = true;
else
hIndex = (hIndex + 1) % HTSize;
if(found)
cerr<<”Duplicate items are not allowed.”<<endl;
else
HT[hIndex] = newItem;
Data Structures Using C++ 22
Linear Probing
• 9 will be next location if h(x) = 6,7,8, or 9
Probability of 9 being next = 4/20, for 14, it’s 5/20, but only 1/20 for 0 or 1
Clustering
Random Probing
• Uses a random number generator to find the next available slot
• ith slot in the probe sequence is: (h(X) + r
i)
% HTSize where r
iis the ith value in a random permutation of the numbers 1 to HTSize – 1
• All insertions and searches use the same sequence of random numbers
Quadratic Probing
• ith slot in the probe sequence is: (h(X) + i2) % HTSize (start i at 0)
• Reduces primary clustering of linear probing
• We do not know if it probes all the positions in the table
• When HTSize is prime, quadratic probing probes about half the table before repeating the probe sequence
Data Structures Using C++ 25
Deletion: Open Addressing
• When deleting, need to remove the item from its spot, but cannot reset it to empty (Why?)
Data Structures Using C++ 26
Deletion: Open Addressing
• IndexStatusList[i] set to –1 to mark item i as deleted
Data Structures Using C++ 27
Collision Resolution:
Chaining (Open Hashing)
• No probing needed; instead put linked list at each hash
position Data Structures Using C++ 28
Hashing Analysis
Let
Then a is called the load factor
Average Number of Comparisons
• Linear probing:
• Successful search:
• Unsuccessful search:
• Quadratic probing:
• Successful search:
• Unsuccessful search:
Chaining:
Average Number of Comparisons
1. Successful search
2. Unsuccessful search
1
Sear ching L ists
aremanyinstanceswhenoneisinterested andsearchingalist: phonecompanywantstoprovidecallerID: venaphonenumberanameisreturned. whoplaystheLottothinkstheycan theirchancetowiniftheykeeptrack thewinningnumberfromthelast3years. thesehavesimplesolutionsusingarrays, lists,binarytrees,etc. ,noneofthesesolutionsisadequate, willsee.Hashing2
Pr oblematic L ist Sear ching
Herearesomesolutionstotheproblems: •ForcallerID: –Useanarrayindexedbyphonenumber. ∗Requiresoneoperationtoreturnname. ∗Anarrayofsize1,000,000,000isneeded. –Usealinkedlist. ∗ThisrequiresO(n)timeandspace,wheren isthenumberofactualphonenumbers. ∗Thisisthebestwecandospace-wise,but thetimerequiredishorrendous. –Abalancedbinarytree:O(n)spaceand O(logn)time.Better,butnotgoodenough. •FortheLottowecouldusethesamesolutions. –Thenumberofpossiblelottonumbersis 15,625,000fora6/50lotto. –Thenumberofentriesstoredisontheorderof 1000,assumingadailylotto.s
Hashing3
The H ash T able Solution
•AHashtableissimilartoanarray: –Hasfixedsizem. –Anitemwithkeykgoesintoindexh(k), wherehisafunctionfromthekeyspaceto {0,1,...,m−1} •Thefunctionhiscalledahashfunction. •Wecallh(k)thehashvalueofk. •Ahashtableallowsustostorevaluesfroma largesetinasmallarrayinsuchawaythat searchingisfast. •Thespacerequiredtostorennumbersinahash tableofsizemisO(m+n).Noticethatthisdoes notdependonthesizeofthekeyspace. •Theaveragetimerequiredtoinsert,delete,and searchforanelementinahashtableisO(1). •Soundsperfect.Thenwhat’stheproblem?Let’s lookatanexample. 4Hash T a ble E xample
5friendswhosephonenumbersIwantto ahashtableofsize8.Theirnamesand are: SusyOlson555-1212 SarahSkillet555-4321 RyanHamster545-3241 JustinCase555-6798 ChrisLindmeyer535-7869 ahashfunctionh(k)=kmod8,wherek phonenumberviewedasa7-digitdecimal . that mod8=5453241mod8=1. tputSarahSkilletandRyanHamster theposition1. efixthisproblem?Hashing5
Hash T a ble P ro blems
•Theproblemwithhashtablesisthattwokeyscan havethesamehashvalue.Thisiscalledcollision. •Thereareseveralwaystodealwiththisproblem: –Pickthehashfunctionhtominimizethe numberofcollisions. –Implementthehashtableinawaythatallows keyswiththesamehashvaluetoallbestored. •Thesecondmethodisalmostalwaysneeded,even forverygoodhashfunctions.Why? •Wewilltalkabout2collisionresolution techniques: –Chaining –OpenAddressing •Butfirst,abitmoreonhashfunctionsHashing6
Hash Functions
•Mosthashfunctionsassumethekeyscomefrom theset{0,1,...}. •Ifthekeysarenotnaturalnumbers,somemethod mustbeusedtoconvertthem. –PhonenumberandIDnumberscanbe convertedbyremovingthehyphens. –CharacterscanbeconvertedusingASCII. –Stringscanbeconvertedbyconvertingeach characterusingASCII,andtheninterpreting thestringofnaturalnumbersasifitwhere storedbase128. •Weneedtochooseagoodhashfunction. •Ahashfunctionisgoodif –itcanbecomputedquickly,and –thekeysaredistributeduniformlythroughout thetable.7
Good Hash Functions
method: h(k)=kmodm mmustbechosencarefully. of2and10canbebad.Why? nottooclosetopowersof2 choice. pickmbychoosinganappropriate thatisclosetothetablesizewe method:LetAbeaconstant 1.Thenweuse h(k)=m(kAmod1), mod1”meansthefractionalpartof ofmisnotascriticalhere. choosemtomaketheimplementation fast.Hashing8
Uni v ersal Hashing
•Letxandybedistinctkeys. •LetHbeasetofhashfunctions. •Hiscalleduniversalifthenumberoffunctions h∈Hforwhichh(x)=h(y)isprecisely|H|/m. •Inotherwords,ifwepickarandomh∈H,the probabilitythatxandycollideunderhis1/m. •Universalhashingcanbeusefulinmany situations. –Youareaskedtocomeupwithahashing techniqueforyourboss. –Yourbosstellsyouthatafteryouaredone,he willpicksomekeystohash. –Ifyougettoomanycollisions,hewillfireyou. –Ifyouuseuniversalhashing,theonlywayhe cansucceedisbygettinglucky.Why?Hashing9
Collision R esolution: Chaining
•Withchaining,wesetupanarrayoflinks, indexedbythehashvalues,tolistsofitemswith thesamehashvalue. 134655 21
320 1 2 3 4 •Letnbethenumberofkeys,andmthesizeofthe hashtable.Definetheloadfactorα=n/m. •Successfulandunsuccessfulsearchingbothtake timeΘ(1+α)onaverage,assumingsimple uniformhashing. •Bysimpleuniformhashingwemeanthatakey hasequalprobabilitytohashintoanyofthem slots,independentoftheotherelementsofthe table. 10
esolution: Open Addr essing
Addressing,allelementsarestoredin severalthings isusedthanwithchainingsince thavetostorepointers. tablehasanabsolutesizelimitofm. lanningaheadisimportantwhenusing haveawaytostoremultipleelements samehashvalue. hashfunction,weneedtouseaprobe Thatis,asequenceofhashvalues. thesequenceonebyoneuntilwe position. wedothesamething,skipping donotmatchourkey.Hashing11
Pr obe S equences
•Wedefineourhashfunctionswithanextra parameter,theprobenumberi.Thusourhash functionslooklikeh(k,i). •Wecomputeh(k,0),h(k,1),...,h(k,i)until h(k,i)isempty. •Thehashfunctionhmustbesuchthattheprobe sequenceh(k,0),...h(k,m−1)isa permutationof0,1,...,m−1.Why? •Insertionandsearchingarefairlyquick,assuming agoodimplementation. •Deletioncanbeaproblembecausetheprobe sequencecanbecomplex. •Wewilldiscuss2waysofdefiningprobe sequences: –LinearProbing –DoubleHashingHashing12
Linear Pr obing
•Letgbeahashfunction. •Whenweuselinearprobing,theprobesequence iscomputedby h(k,i)=(g(k)+i)modm. •Whenweinsertanelement,westartwiththehash value,andproceedelementbyelementuntilwe findanemptyslot. •Forsearching,westartwiththehashvalueand proceedelementbyelementuntilwefindthekey wearelookingfor. •Example:Letg(k)=kmod13.Wewillinsert thefollowingkeysintothehashtable: 18,41,22,44,59,32,31,73 5432161211109870 •Problem:Thevaluesinthetabletendtocluster.13
Double H ashing
hashingweusetwohashfunctions. andh2behashfunctions. ourprobesequenceby h(k,i)=(h1(k)+i×h2(k))modm. pickm=2nforsomen,andensurethat isalwaysodd. hashingtendstodistributekeysmore thanlinearprobing.Hashing14
Double H ashing Example
•Leth1=kmod13 •Leth2=1+(kmod8),and •Letthehashtablehavesize13. •Thenourprobesequenceisdefinedby h(k,i)=(h1(k)+i×h2(k))mod13 =(kmod13+i(1+(kmod8)))mod13 •Insertthefollowingkeysintothetable: 18,41,22,44,59,32,31,73 5432161211109870Hashing15
Open Addr essing: P erf ormance
•Again,wesetα=n/m.Thisistheaverage numberofkeysperarrayindex. •Notethatα≤1foropenaddressing. •Weassumeagoodprobesequencehasbeenused. Thatis,foranykey,eachpermutationof 0,1...,m−1isequallylikelyasaprobe sequence. •Theaveragenumberofprobesforinsertionor unsuccessfulsearchisatmost1/(1−α). •Theaveragenumberofprobesforasuccessful searchisatmost 1 αln1 1−α+1 α. 16OA: SS OA: I/US Ch: S Ch: I
0 5 10 15 20 0 0.2
0.4 0.6
0.8 1
Hashing Performance: Chaining Verses Open Addressing
1 1+x 1/(1-x) (1/x)*log(1/(1-x))+1/x