I
magine the desire to store Boolean information in a bit array—a very simple premise. Simply assign each element in the bit array to a specific meaning, and then assign it a value. In this scenario, it takes 1 bit in the array to store 1 bit of stored information. The bit array faithfully represents its relative value with 100-percent accuracy. This, of course, works best when the stored data is array oriented such as a transient over time or space. However, what if the data is not a linear transient-oriented data set?Bloom's Way
In 1970, Burton H. Bloom published a simple and clever algorithm [Bloom70] in the
"Communications of the ACM." In his publication, Bloom suggests using a "Hash Coding with Allowable Errors" algorithm to help word processors perform capitaliza-tion or hyphenacapitaliza-tion on a document. This algorithm would use less space and be faster than a conventional one-to-one mapping algorithm. Using this example, a majority of words (90 percent, for example) could be checked using a simple rule, while the smaller minority set could be solved with an exception list used to catch the instances where the algorithm would report a word as simply solvable when it was not. Bloom's motivation was to reduce the time it took to look up data from a slow storage device.
Possible Scenarios
A Bloom Filter can reduce the time it takes to compute a relatively expensive and rou-tinely executed computation by storing a true Boolean value from a previously executed computation. Consider the following cases where we'd like to improve performance:
• Determine if a polygon is probably visible from an octree node.
• Determine if an object probably collides at a coordinate.
• Determine if a ray cast probably intersects an object at a coordinate.
133
All of these cases fit into a general scenario. Each case involves an expensive com-putation (CPU, network, or other resource) where the result is a Boolean (usually false) answer. It is important to note that that the word probably is used in each case because a Bloom Filter is guaranteed to be 100-percent accurate if the Bloom Filter test returns a false (miss), but is, at best, only probably true if the Bloom Filter returns true (hit).
A Bloom Filter can store the true result of any function. Usually, the function parameter is represented as a pointer to a byte array. If we wish to store the result of a function that uses multiple parameters, we can concatenate the parameters into a sin-gle function parameter. In cases where 100-percent accuracy is needed, we must com-pute the original expensive function to determine the absolute result of the expensive function, if a Bloom Filter test returns true.
How It Works
There are two primary functions in a Bloom Filter: a function for storing the Boolean true value returned from an expensive function, and a function for testing for a previ-ously stored Boolean true value. The storing function will accept input in any form and modify the Bloom Filter Array accordingly. The testing function will accept input in the same form as the storing function and return a Boolean value. If the testing function returns false, it is guaranteed that the input was never previously stored using the storing function. If the function returns true, it is likely that the input was previ-ously stored using the storing function. A false positive is a possible result from the test. If 100-percent accuracy is desired, perform the original expensive function to determine the absolute value. A conventional Bloom Filter is additive, so it can only store additional Boolean true results from an expensive function and cannot remove previously stored values.
Definitions
The high-quality operation of a Bloom Filter requires a high-quality hash function that is sometimes referred to as a message digest algorithm. Any high-quality hash function will work, but I recommend using the MD5 message digest algorithm [RSA01] from RSA Security, Inc., which is available in source code on the Net, and is also documented in RFC 1321. The MD5 hash function takes N bytes from a byte array and produces a 16-byte (128-bit) return value. This return value is a hash of the input, which means if any of the bits in the input change (even in the slightest), the return value will be changed drastically. The return of the hash function, in Bloom terminology, is called the Bloom Filter Key.
Bloom Filter Indexes are obtained by breaking the Bloom Filter Key into blocks of a designated bit size. If we choose a Bloom Filter Index bit size of 16 bits, a 128-bit Bloom Filter Key can be broken into eight complete 16-bit segments. If there are remaining bits left over from breaking the Key into complete segments, they are discarded.
The number of Bloom Filter Phases used in a Bloom Filter is the number of Bloom Filter Indexes used to store the Boolean value from the expensive function. For example, three phases might be used from a 128-bit key using a Bloom Filter Index bit size of 16 bits. The remaining five indexes will be discarded, in this example.
A Bloom Filter Array is used to store the expensive function's Boolean value. For example, if the Bloom Filter Index bit size is 16 bits, the Bloom Filter Array will be 216 bits long, or 64K bits (8K bytes). The larger the array, the more accurate the Bloom Filter test.
The Bloom Filter Saturation of the Bloom Filter Array is the percentage of bits set to true in the bit array. A Bloom Filter Array is optimal when saturation is 50 percent, or half of the bits are set and half are not.
Example 1
For an example, we will store the function parameter ("Mikano is in the park") using three phases with an index bit size of 16 bits into an array 64k bits long (8k bytes). In this instance, the expensive function was used to determine if Mikano was truly in the park and the result was yes (true). Although we used a string variable, in this case, any variable format will work. The format of the stored expensive function parameter data is independent of the Bloom Filter performance, accuracy, or memory usage.
First, the hash function is computed from the expensive function parameter data.
Let's assume that the hash function returned the 128-bit value Oxl0027AB30001BF 7877AB34D976A09667. The first three segments of 16-bit indexes will be 0x1002, 0x7AB3, and 0x0001. The remaining segments are ignored.
The Bloom Filter Array starts out reset (all false bits), before we begin to populate the bit array with data. Then, for each of these indexes, we will set the respective bit index in the Bloom Filter Array to true regardless of its previous value. As the array becomes populated, sometimes we will set a bit to true that has already been set to true. This is the origin of the possible false positive result when testing the Bloom Fil-ter Array (Figure 1.20.1).
When we wish to examine the Bloom Filter Array to determine if there was a pre-viously stored expensive function parameter, we proceed in almost the same steps as a store, except that the bits are read from the Bloom Filter Array instead of written to them. If any of the read bits are false, then the expensive function parameter was absolutely never previously stored in the Bloom Filter Array. If all of the bits are true, then the expensive function parameter was likely previously stored in the Array. In the case of a true result, calculate the original expensive function to accurately determine the Boolean value (Figure 1.20.2).
Tuning the Bloom Filter
Tuning the Bloom Filter involves determining the number of phases and the bit size of the indexes. Both of these variables can be modified to change the accuracy and capacity of the Bloom Filter. Generally speaking, the larger the size of the bit array
void store_bloom_data("Mikano is in the park") 128 Bloom Key divided into 8 16-bit segments
3 phase, 16-bit (8K Byte) Bloom Filter
boolean test_bloom_data("Mikano is in the park") read
Bit Value
boolean test_bloom_data("Mikano is in the office ")
"Mikanq is in
(not set so return false)
r-J
•w
\s
•w
If OxFFFC was also set, then a false positive would
be returned. FIGURE 1.20.1 Flow of a Bloom Filter.
// returns a pointer to 16 bytes of data that represent the hash void * compute_hash ( pData, nDataLength );
// returns the integer value for the bits at nlndex for nBitLength long int get_index_value( void * pData, int nBitlndex, int nBitLength );
// tests a bit in the Bloom Filter Array.
// returns true if set otherwise returns false boolean is_bit_index_set( int nlndexValue );
// sets a bit in the Bloom Filter Array void set_bit_index( int nlndexValue );
void store_bloom_data( void * pData, int nDataLength ) {
void *pHash;
int nPhases = 3, nPhaselndex = 0, nBitlndexLength = 16;
// returns pointer to 16 bytes of memory pHash = compute_hash( pData, nDataLength );
// now set each bit
while { nPhaselndex < m nPhases )
Theoretically, a different input parameter could return the same
value but that is unprobable.
Either way, the algorithm will still work.
nlndexValue = get_index_value( pHash, nPhaselndex, nBitlndexLength );
// if bit is not set, we have a miss so return false set_bit_index( nlndexValue ) ;
nPhase!ndex++;
boolean test_bloom_data( void * pData, int nDataLength ) void *pHash;
int nPhases = 3, nPhaselndex = 0, nBitlndexLength =. 16;
// returns pointer to 16 bytes of memory pHash = compute_hash( pData, nDataLength ); 4-// now test each bit
while ( nPhaselndex < m nPhases )
compute_hash will always return the same 16 bytes of data when called with the same input parameters.
nlndexValue = get_index_value( pHash, nPhaselndex, nBitlndexLength // if bit is not set, we have a miss so return false
if ( !is_bit_index_set( nlndexValue ) ) return( false );
nPhase!ndex++; *s.
// all bits are set so we have a probably hit.
return( true );
Return false as soon as we find a false bit. At this point,
the expensive function has definitely not been previously
stored.
FIGURE 1.20.2 Basic use of a Bloom Filter.
(N) and the more phases, the less likely a false positive response will occur. Bloom asserted that the optimum performance of this algorithm occurs when saturation of the bit array is 50 percent. Statistically, the chance of a false positive can be deter-mined by taking the array saturation and raising it to the power of the number of phases. Other equations are available to tune the Bloom filter algorithm.
The equation to calculate the percentage of false positives is:
percent_false_pdsitive = saturationnumb"-°f-fhases
or expressed as a function of percent_false_positive:
number_of_j>hases = Logsaturation(percentjalse_fositive)
By assuming that the Bloom Filter Array is operating at optimum capacity of 50-percent saturation, Table 1 .20. 1 can be computed from the preceding formulas.
For example, if we want the false positive rate below half a percent (0.5 percent), eight phases must be used, which will return a worst-case scenario of 0.39-percent false positives.
Next, we calculate the Bloom Filter Array bit size.
array_bit_size = ( number_of_phases * max_stored_input
The array_bit_size is usually rounded up to the nearest value where array_bit_size can be expressed as 2 to the power of an integer.
array _bit_size = .2*
Finally, compute the index_bit_size from the array_bit_size.
array <_bit_size = 2>ndex-bit-"z*
Table 1.20.1 Percentage of False Positives Based on Number of Phases Used
percent_false_positive number_of_phases 50.00%
25.00%
12.50%
6.13%
3.13%
1.56%
078%
0.39%
Example 2
Suppose we want to store a maximum of 9000 expensive function parameters with at least 95-percent accuracy when the Bloom Filter Array test returns true. From Table 1.20.1, we can determine that five phases will be necessary to obtain an accu-racy of equal to or greater than 95 percent and a false positive of less than or equal to 5 percent.
5 phases * 9000 expensive function parameters / -ln(0.5) = 64,921 bits
Rounding up to the nearest 2n gives us 64K bits (8K bytes), and because 216 = 64K, the index_bit_size will be 16 bits.
Final Notes
One way to improve performance is to use an exception list to prevent executing the expensive function, as Bloom did in his algorithm. An exception list contains all of the false positive cases that can be returned from testing a Bloom Filter. This can be computed at parameter storage or dynamically when false positives are detected (Fig-ure 1.20.3).
Another way to improve performance is to dynamically build a Bloom Filter Array. If the range of expensive function parameters is too great, Bloom Filters can be calculated dynamically and optimized for repetitive calls to test the bit array. By dynamically building a Bloom Filter Array, the commonly tested expensive function parameters are calculated once, and untested function parameters do not waste space in the bit array.
Standard Bloom Filter Test Code
if ( test_bloom_data(c ) ) boolean bSuccess = false;
if ( in_exception_list ( c ) ) return ( bSuccess ) ; f*
bSuccess = expensive_f unction (c ) ; if ( ibSuccess ) add_to_excepti return ( bSuccess ) ;
else
if ( expensive_f unction (c ) ) store_bloom_data ( c ) ; return true;
return false;
/ i -.
Optional Code
Exception List Test Dynamically computed Exception List
Dynamically computed Bloom Filter
FIGURE 1.20.3 Bloom Filter configurations.
Here are some interesting Bloom Filter characteristics:
• Two Bloom Filter Arrays can be merged together by bitwise ORing them.
• Bloom Filter Arrays can be shared among parallel clients.
• Optimized Bloom Filter Arrays are not compressible.
• Underpopulated Arrays are very compressible.
• Memory corruption in the array can be mended by setting unknown bits to true.
Conclusion
Bloom Filters offer a method of improving performance of repeatedly called expensive functions at the expense of memory. While this method has been documented for a long time, it remains a relatively unused technique, although exceptions exist, such as Bloom Filter usage in the very popular Web-caching program Squid (www.squid-cache.org/) by Duane Wessels. Adding a Bloom Filter algorithm to a program can usually be done in less that 20K bytes of code. As with most performance-enhancing tricks, it is a good idea to add Bloom Filters to a project during the optimization stage, after the main functionality is finished.
References
[BeachOl] Beach Software, "Bloom Filters," available online at http://
beachsoftware.com/bloom/, May 10, 2000.
[RSA01] RSA Security, "What Are MD2, MD4, and MD5," available online at www.rsasecurity.com/rsalabs/faq/3-6-6.html, March 4, 2001.
[FlipcodeOl] Flipcode, "Coding Bloom Filters," available online at "www.flipcode .com/tutorials/tut_bloomfilter.shtml, September 11, 2000.
[Bloom70] Bloom, Burton H., "Space/Time Trade-Offs in Hash Coding with Allow-able Errors," Communications of the ACM, Vol. 13, No.7 (ACM July 1970): pp.
422-426.