h =
for (p = (unsigned char s t r ; !=
h = MULTIPLIER h +
r e t u r n h % NHASH;
The calculation uses unsigned characters because whether char is signed is not speci- fied by C and and we want the hash value to remain positive.
The hash function returns the result modulo the size of the array. If the hash func- tion distributes key values uniformly, the precise array size doesn't matter. It's hard to be certain that a hash function is dependable, though, and even the best function may have trouble with some input sets, so it's wise to make the array size a prime number to give a bit of extra insurance by guaranteeing that the array size, the hash multiplier, and likely data values have no common factor.
Experiments show that for a wide variety of strings it's hard to construct a hash function that does appreciably better than the one above, but it's easy to make one that does worse. An early release of Java had a hash function for strings that was more efficient if the string was long. The hash function saved time by examining only 8 or 9 characters at regular intervals throughout strings longer than 16 characters. starting at the beginning. Unfortunately, although the hash function was faster, it had bad sta- tistical properties that canceled any performance gain. By skipping pieces of the string, it tended to miss the only distinguishing part. File names begin with long iden- tical prefixes-the directory name-and may differ only in the last few characters
j a v a versus .class). URLs usually begin with h t t p : and end with
.
html,so they tend to differ only in the middle. The hash function would often examine only the non-varying part of the name, resulting in long hash chains that slowed down searching. The problem was resolved by replacing the hash with one equivalent to the one we have shown (with a multiplier of which examines every character of the string.
A hash function that's good for one input set (say, short variable names) might be poor for another so a potential hash function should be tested on a variety of typical inputs. Does it hash short strings well? Long strings? Equal length strings with minor variations?
Strings aren't the only things we can hash. We could hash the three coordinates of a particle in a physical simulation, reducing the storage to a linear table of particles)) instead of a three-dimensional array
x
ysizex
zsize)).One remarkable use of hashing is Gerard Holzmann's Supertrace program for ana- lyzing protocols and concurrent systems. Supertrace takes the full information for each possible state of the system under analysis and hashes the information to gener- ate the address of a single bit in memory. If that bit is on, the state has been seen
58 ALGORITHMS AND DATA STRUCTURES CHAPTER 2
before; if not, it hasn't. Supertrace uses a hash table many megabytes long, but stores only a single bit in each bucket. There is no chaining; if two states collide by hashing to the same value, the program won't notice. Supertrace depends on the probability of collision being low (it doesn't need to be zero because Supertrace is probabilistic. not exact). The hash function is therefore particularly careful; it uses a cyclic redundancy check, a function that produces a thorough mix of the data.
Hash tables are excellent for symbol tables, since they provide expected access to any element. They do have a few limitations. If the hash function is poor or the table size is too small, the lists can grow long. Since the lists are unsorted, this leads to behavior. The elements are not directly accessible in sorted order, but it is easy to count them, allocate an array, fill it with pointers to the elements, and sort that. Still, when used properly, the constant-time lookup, insertion, and deletion prop- erties of a hash table are unmatched by other techniques.
Exercise 2-14. Our hash function is an excellent general-purpose hash for strings.
Nonetheless, peculiar data might cause poor behavior. Construct a data set that causes our hash function to perform badly. Is it easier to find a bad set for different values of NHASH?
Exercise 2-15. Write a function to access the successive elements of the hash table in
unsorted order.
Exercise 2-16. Change lookup so that if the average list length becomes more than x,
the array is grown automatically by a factor of y and the hash table is rebuilt.
Exercise 2-17. Design a hash function for storing the coordinates of points in 2
dimensions. How easily does your function adapt to changes in the type of the coor- dinates, for example from integer to floating point or from Cartesian to polar coordi- nates, or to changes from 2 to higher dimensions?
2.10
Summary
There are several steps to choosing an algorithm. First, assess potential algo- rithms and data structures. Consider how much data the program is likely to process. If the problem involves modest amounts of data, choose simple techniques; if the data could grow, eliminate designs that will not scale up to large inputs. Then, use a library or language feature if you can. Failing that, write or borrow a short, simple, easy to understand implementation. Try it. If measurements prove it to be too slow, only then should you upgrade to a more advanced technique.
Although there are many data structures, some vital to good performance in spe- cial circumstances, most programs are based largely on arrays, lists, trees, and hash tables. Each of these supports a set of primitive operations, usually including: create a
SECTION 2.10 SUMMARY
new element, find an element, add an element somewhere, perhaps delete an element, and apply some operation to all elements.
Each operation has an expected computation time that often determines how suit- able this data type (or implementation) is for a particular application. Arrays support constant-time access to any element but do not grow or shrink gracefully. Lists adjust well to insertions and deletions, but take time to access random elements. Trees and hash tables provide a good compromise: rapid access to specific items combined with easy growth, so long as some balance criterion is maintained.
There are other more sophisticated data structures for specialized problems, but this basic set is sufficient to build the great majority of software.
Supplementary Reading
Bob Sedgewick's family of Algorithms books (Addison-Wesley) is an excellent place to find accessible treatments of a variety of useful algorithms. The third edition of Algorithms in C + + (1998) has a good discussion of hash functions and table sizes. Don The Art of Computer Programming (.Addison-Wesley) is the definitive source for rigorous analyses of many algorithms; Volume 3 (2nd Edition, 1998) cov- ers sorting and searching.
Supertrace is described in Design and Validation of Computer by Ger- ard Holzmann (Prentice Hall.
Jon and Doug describe the creation of a fast and robust quicksort in "Engineering a sort function," Software-Practice and Experience, 23, 1,