In this section, the main idea of StriDFA is demonstrated with an example. For the sake of simplicity, the example used here only considers string matching, which is a special case of regular expression matching. In the next chapter, a general solution will be presented to cover both string matching and regular expression matching.
4 f 5 e r 6 p 1 Others 2 3 r e 7 e 8 n 9 c e 10 11 l 12 a 13 c 14 15 e 16 m e 17 n 18 t 19 f Except 3 r 5
Figure 3.1: Traditional DFA of patterns “reference” and “replacement”. Some default transitions are omitted for simplicity.
3.2.1 Traditional DFA in Multi-string Matching
Suppose we have two patterns to match: “reference” (P1) and “replacement” (P2). The conventional scheme is to first covert the patterns to a DFA, which is shown in Figure 3.1. The matching is performed by sending the input stream to the automaton byte by byte. If the DFA reaches any of its accept states, we say a match is found. It is easy to see that the number of states to be visited during the processing is equal to the length of the input stream (in units of bytes) and this number determines the time required for the matching process (each state visit requires a memory access, which is a major bottleneck in today’s computer systems).
In this scheme, I want to reduce the number of states to be visited during the matching process. If this objective is achieved, the number of memory accesses required for detecting a match can be reduced, and consequently, the pattern matching speed can be improved. One way to achieve this objective is to reduce the number of characters sent to the DFA.
3.2.2 Stride-based DFA
Instead of comparing the input stream character by character with patterns in the rule set; I pick tag characters from the input stream and feed the “fingerprint”
referenceabcdreplacement
F
eS
Figure 3.2: Use tag to convert input stream into SL stream with tag ‘e’.
of these tag characters to the automaton for the matching examination. Since the fingerprint is normally much shorter than the original input stream, the number of state visits required by the matching process can be significantly reduced.
Here, I use distance (or the number of characters) between adjacent tags (de- noted as “stride lengths” or step sizes) as the fingerprint. Stride lengths ex- tracted from the rule set are compared with stride lengths extracted from the input strings for coarse grained matching.
For example in Figure 3.2, character ‘e’ is selected as the tag8. Stride lengths extracted from the rule set are compared with stride lengths extracted from the input strings for coarse grained matching.
Definition 1. Stride Length (SL) is the distance between every two adjacent tags. In our scheme, instead of feeding the automaton with single-byte charac- ters, we feed the new SL automaton (StriDFA) with the “distance”s (called stride lengths (SL)) between two adjacent tags we find in the input stream.
Definition 2. A convertor converts the original input stream to its corresponding SL stream.
Definition 3. Let Fx(S) denote the SL stream of S when using x as the tag.
Consider the example in Figure 3.2, the input stream referenceabcdrepla cement to be fed into the SL automaton (StriDFA) is Fe(S) = 2 2 3 6 5 2 where ‘e’ denotes the tag character in use. The underscore is used to indicate a SL, to distinguish it as not being a character.
StriDFA with tag ‘a’
F
a(P)
F
b(P)
Tag ‘a’ Convertors Tag ‘b’Rule
Set
StriDFA Matching Engines
12 3 5
StriDFA with tag ‘b’
Figure 3.3: Convert patterns to the corresponding StriDFA.
Clearly, the volume of processing to be performed by the SL DFA is reduced compared with the original DFA. The original DFA needs to process 24 input characters, while the new SL DFA only needs to process 6 input SLs.
Of course, the DFA needs to be modified in order to handle the input “stride” (the new DFA variant is called as StriDFA). The construction of StriDFA in this example is very simple. What we need to do is first convert the patterns to SL sequences. Then the SL sequences are used to construct StriDFA according to the traditional DFA construction method. Figure 3.3 describes how to construct StriDFA from the original rule set.
As shown in Figure 3.4, the SL of patterns P1 and P2 are Fe(P1) = 2 2 3 and Fe(P2) = 5 2 with tag ‘e’. After obtaining the SLs, I can then use the classical DFA construction algorithm to build the StriDFA.
The original DFA and its corresponding StriDFA associated with the pattern P1 and P2 are given in Figure 3.1 and Figure 3.4, respectively. Note that the tran- sitions in the StriDFA are labeled with SLs rather than characters.
reference
2 2 3
replacement
5
2
1 2 2 3 2 3 4 5 5 6 2 2 2 5 5 5 5 2 Others 5referenceabcdreplacement
matched matched Input String:P
1P
2Figure 3.4: The sample StriDFA of patterns “reference” and “replacement” with char- acter ‘e’ as tag.
3.2.3 Proof of Correctness
In this section, the correctness of StriDFA when making a match is proved. The correctness here means for any given input stream, if the original DFA can be matched, the corresponding StriDFA can be matched too; if the StriDFA cannot be matched, then the original DFA cannot be matched either.
Lemma 1. If StriDFA cannot be matched, then the corresponding original DFA cannot be matched either.
Proof. Denote A = {DFA can be matched}, then A = {DFA cannot be matched};
B = {StriDFA can be matched}, then B = {StriDFA cannot be matched}. A → B will be proved firstly. Assume the original DFA can be matched at the final state of pattern P = p1p2· · · pm by input string T . Then there always exist an i in T that titi+1· · · ti+m−1 = p1p2· · · pm. Specifically, it means ti = p1, ti+1 = p2, · · · ti+m−1 = pm.
According to the definition of Fx(S) in Definition 3, Fx(titi+1· · · ti+m−1) = Fx(p1p2· · · pm), that is, the stride length sequences of P have the same stride length sequences from the input string T . In other words, if the original DFA
Verification Module Verification & StriDFA with ‘a’ 1 2 3 5 StriDFA with ‘b’ Fa(P) Fb(P) Tag ‘a’ … Fa(S) Fb(S) Input Stream S Convertors Tag ‘b’ SL Stream No No Normal Traffic Yes Malicious Traffic & Yes Tag ‘a’ Convertors Tag ‘b’ Rule Set
StriDFA Matching Engine
Input Buffer
Figure 3.5: The overall structure of StriDFA.
can be matched then StriDFA can be matched by the input string T (A → B). If a statement is true, the contrapositive is also logically true. Here the state- ment is that if the original DFA can be matched then the corresponding StriDFA can also be matched is true which has been proved (A → B). So the contraposi- tive statement is also true: B → B. Finally we have proven that if StriDFA cannot be matched, then the corresponding original DFA cannot be matched either.