Optimizations - Algorithms and Architectures for Network Search Processors

3.5 4 4.5 5 5.5 6

1 1.5 2 2.5 3 3.5 4

Expected # of hash probes per lookup

Size of embedded memory (MBits) 100000 prefixes 150000 prefixes 200000 prefixes 250000 prefixes

Figure 3.4: The average number of hash probes per lookup, τ_avg1, versus the total embedded memory size, M , for various values of total prefixes, N , using a basic configuration for IPv4 with 32 Bloom filters.

perform 166 million lookups per second. In the next section, we describe optimizations to this basic configuration that reduce the worst case memory accesses.

3.4 Optimizations

The worst case memory accesses depend on the number of Bloom filters. Therefore, the number of filters must be reduced to improve the performance. We describe two enhancements to this basic configuration to reduce the number of filters. The first enhancement uses a direct indexing array for prefixes of lengths below a certain length. The second enhancement uses Controlled Prefix Expansion (CPE) [32] to reduce the number of distinct prefix lengths and Bloom filters².

2These optimizations were suggested by David Taylor

3.4.1 Direct Indexing

Lookups for short prefix lengths can be easily done by using the prefix as the address to index into an array. The index location stores the next hop information. To lookup a prefix of length i bits, we need to maintain an array of 2ⁱ bits. Indeed, if the memory were plentiful, then we could maintain an array for all possible values in an IP address, 2^W. This becomes expensive and impractical for large W such as 32 for IPv4 and beyond the technological reach for W = 128 for IPv6. However, direct indexing can still be used for prefixes of short lengths such as up to 20 bits since they can be accommodated by a million array locations. Supporting a million entries is certainly feasible with today’s SRAM. A straightforward indexing scheme would require an array for each prefix length. For instance, we would require a 2-bit array for prefix length 1, 4-bit array for prefix length 2, 8-bit array for prefix length 3, and so on. This requires separate memory blocks to keep these arrays and allow parallel lookups. However, the requirement for individual arrays can be avoided easily if we expand all the shorter prefixes to a fixed prefix length and use a single array for it. For instance, if we decide to use a direct lookup array for prefixes of length up to 8, then given a prefix with a shorter length, say 101*, we need to expand it by enumerating all the prefixes from 10100000* to 10111111* and associate the same forwarding information to each of these prefixes. This certainly increases the number of prefixes in the table but does not require more memory since the direct lookup array would have space for all the 2⁸ possible prefixes. Therefore, expanding any number of prefixes with any prefix length shorter than 8 bits will not require more than 2⁸ entries. It should be noted that two prefixes of different lengths when expanded to this upper limit might need to share the same location. For instance, if with the prefix 101* there is another prefix 1010* in the table, then they will share the same set of entries ranging from 10100000 to 10101111. In that case, the forwarding information of the longer original prefix, 1010*, will be associated with all the resulting entries.

In practice, it is feasible to use a direct lookup array for prefixes up to length 20. An array of 2²⁰ = 1M locations can be used with each location containing the next hop information. Assuming that the forwarding information is an IP address requiring four bytes, the direct lookup array needs 4MB of space. Commodity SRAM chips

can be used to implement this array. A drawback is that the incremental updates take longer since multiple array entries need to be populated or deleted for inserting or deleting a shorter prefix.

After using the direct lookup for prefix lengths 20 or less, we are left with only 12 Bloom filters corresponding to prefix lengths from 21 to 32. Hence, the average memory accesses per lookup are

τavg2 = 12f + 1 = 12

1 2

M ln 2 N −P²⁰

i=1ni

+ 1 (3.11)

and the worst case is reduced to

τworst2 = 13 (3.12)

Thus, at the cost of using more off-chip memory for direct indexing array, we can reduce both the average and worst case lookup performance. For the same amount of on-chip memory, this hybrid scheme exhibits better performance since the memory released by the 20 Bloom filters can now be used for the remaining 12 Bloom filters to reduce their false positive rates. In order to evaluate the performance of our system after this optimization, a specific prefix length distribution must be taken into account. We collected 15 IPv4 BGP tables from [4]. The average prefix length distribution for these BGP tables is shown in Figure 3.5.

From the distribution analysis, it was discovered that 24.6% of the prefixes range from 1 to 20 bits. With the direct lookup array for these prefixes, a significant portion of the table is eliminated from the Bloom filters. By substituting ^P²⁰_i=1ni = 0.246N in Equation 3.11, the resulting performance can be evaluated numerically as plotted in Figure 3.6. A comparison of Figure 3.4 and Figure 3.6 shows that the average memory accesses are reduced significantly for the same amount of memory with the direct lookup array.

0 10 20 30 40 50 60

5 10 15 20 25 30

Percentage of prefixes

Prefix length

Figure 3.5: Average prefix length distribution for IPv4 BGP table snapshots.

1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2

1 1.5 2 2.5 3 3.5 4

Expected # of hash probes per lookup

Size of embedded memory (MBits) 100000 prefixes 150000 prefixes 200000 prefixes 250000 prefixes

Figure 3.6: Average number of hash probes per lookup, τ_avg2, versus total em-bedded memory size, M , for various values of total prefixes, N , using a direct lookup array for prefix lengths 1 . . . 20 and 12 Bloom filters for prefix lengths 21 . . . 32

3.4.2 Controlled Prefix Expansion

More Bloom filters can be further eliminated by CPE. An analysis of the prefix tables shows that there are fewer prefixes of lengths 21- to 23-bits than 24-bit prefixes. Also, there are very few prefixes of lengths 25 to 31. Therefore, all the 21- to 23-bit prefixes

can be expanded to 24-bit prefixes and all the 25- to 31-bit prefixes to 32-bit prefixes.

While this expansion results in more prefixes, the number of Bloom filters can now be reduced from 12 to 2, one corresponding to 24 bit prefixes and the other to 32-bit prefixes. Let α denote the expansion factor for prefixes of lengths 21 to 24: the ratio of total prefixes after expansion to the total prefixes before expansion. Likewise, let β denote the expansion factor for 25-32 bit prefixes. Our experiments with real prefix tables show that α ≈ 1.8 and β ≈ 50. Although β is large, the overall number of prefixes of lengths 25 to 32 is very small and the expansion is tolerable. With CPE, the new equations are

τavg3 = 2f + 1 = 2

1 2

M ln 2 N −P20

i=1ni+αP23

i=21ni+βP31 i=25ni

+ 1 (3.13)

τ_worst3= 3 (3.14)

The performance is plotted in Figure 3.7. The expansion overhead increases the average memory accesses slightly since there are more prefixes to be handled using the same amount of memory. At the same time, great savings are achieved in the worst case memory accesses. At most, there are two Bloom filter matches and a final direct lookup access. Again, the CPE increases the cost of incremental updates. Any prefix with a length between 21 to 23 or 25 to 31 now needs to be expanded into multiple 24-bit or 32-bit prefixes and requires as many insertions in the table.

In document Algorithms and Architectures for Network Search Processors (Page 52-56)