Pattern Matching
APPROXIMATE USAGE STRING MATCHING
Using the usage tree to find all illegal usage is not possible because the tree is not able to cover all usage patterns. This research applies two methods of approximate usage matching to mobile data pro-tection without needing all usage patterns stored:
(i) approximate string/pattern matching and (ii) finite usage automata. The former method will be covered in this section, while the latter method will be covered in the next section.
Approximate String/Pattern Matching The string matching problem, given strings P and X, examines the text X for an occurrence of the pattern P as a substring, namely, whether the text X can be written as X = YPY’, where Y and Y’ are strings. String matching is an important compo-nent of many problems, including text editing, bibliographic retrieval, and symbol manipulation.
Several algorithms for this problem have appeared Figure 5. A sample simplified usage tree
in the literature (Baeza-Yates & Gonnet, 1992). In some instances, however, the pattern and/or the text are not exact. For example, the name may be misspelled in the text. The approximate string matching problem reveals all substrings in X that are close to P under some measure of closeness.
The most common measure of closeness is known as the edit distance, which determines whether X contains a substring P’ that resembles P in at most a certain edit distance from P to P’. The editing operations may include: (i) insertion, (ii) replacement, (iii) deletion, (iv) swap (interchang-ing any two adjacent characters), and (v) regular expression matching. Some approximate string-matching algorithms can be found in the related literature (Wu & Manber, 1992).
Longest Approximate Common Subsequences
Finding a longest common subsequence (LCS) (Hirschberg, 1977) for two strings occurs in a number of computing applications. A longest common subsequence is mainly used to measure the discrepancies between two strings. An LCS, however, does not always reveal the degree of difference between two strings that some prob-lems require. For example, if s0 = ⟨a, b⟩, s1 = ⟨b, although not in the same order. Approximating an LCS better characterizes the discrepancies between two strings.
A longest approximate common subsequence (LACS) problem produces a maximum-gain ap-proximate common subsequence of two strings (Hu, Ritter, & Schmalz, 1998). An approximate subsequence of a string X is a string edited from a subsequence of X. The only editing operation allowed here is an adjacent symbol interchange.
String Z is an approximate common subsequence
of two strings X and Y if Z is an approximate subsequence of both X and Y. The gain function g, which is described later, assigns a nonnegative real number to each subsequence. Formally, the LACS problem is defined as follows: Given two strings X and Y, a weight Wm>0 for a symbol in an approximate common subsequence, and a weight Ws≤0 for an adjacent symbol interchange opera-tion, a string Z is a longest approximate common subsequence of X and Y if Z satisfies the following two conditions:
1. Z is an approximate common subsequence of X and Y, and
2. The gain g(X,Y,Z,Wm,Ws) = |Z|Wm+δ(X,Z) Ws+δ(Y,Z) Ws is a maximum among all ap-proximate common subsequences of X and Y, where δ(X,Z) is the minimum edit distance from a subsequence of X to Z, and δ(Y,Z) is the minimum edit distance from Y to Z.
A string Z is said to be at an edit distance k to a string Z’ if Z can be transformed to be equal to Z’ with a minimum sequence of k adjacent symbol interchanges. The following is an LACS example.
Let X = ⟨B, A, C, E, A, B⟩, Y = ⟨A, C, D, B, B, A⟩, another way, known as a trace (Wagner, 1975).
Diagrammatically aligning the input strings X and Y and drawing lines from symbols in X to their matches in Y provides the trace of X and Y. Figure 6 illustrates the above example through a trace.
In an LACSi trace, each line is allowed to have a maximum of i line-crossings, i.e. the symbol touched by the line may make no more than i adjacent symbol interchanges. The total number of line-crossings in a trace is δ(X, Z) + δ(Y, Z).
Usage Finite Automata
Finding a sequence from the usage tree is costly because the running time of the matching is at least O(|V1||V2|), where V1 and V2 are the node sets of the sequence and tree, respectively. To speed up the searches, this research applies the finite-automaton technologies (Aho, Lam, Sethi,
& Ullman, 2006) to usage-pattern matching. A usage finite automaton M is a 5-tuple (Q, q0, A, Σ, δ) where
• Q, which is a finite set of states,
• q0∈ Q, which is the start state,
• A ⊆ Q, which is a distinguished set of ac-cepting states,
• Σ, which is a set of events, and
• δ, which is a function from Q×Σ into Q, called the transition function of M.
For a prepared usage tree from the Part B of the previous section, a usage DFA (deterministic finite automaton) M can be constructed by fol-lowing the steps below:
1. Each path starting at the root and ending at a leaf is a regular expression. For example, the regular expression of the path Checking schedule (H) → Making phone calls (P) → Checking IMs (I) → Sending IMs (M) is
“HPIM” where the letters are the shorthands of the events in Figure 5.
2. Combine all regular expressions into a regular expression by using the “or” op-erator ‘|’. For example, the result regular expression of the usage tree in Figure 5 is
“VPVP|VEL|HPIM|HTBPW.”
3. Convert the regular expression into an NFA (nondeterministic finite automata).
4. Convert the NFA to a DFA where
ο An edge label is an event such as making phone calls.
ο An accepting state represents a match of a pattern.
For example, the DFA of the usage tree in the Figure 5 is given in Figure 7, where the nodes of double circles are the accepting states.
Using a DFA to store usage patterns and search for patterns is an effective, convenient way, but this approach also suffers the following shortcom-ings:
• The DFA may accept more patterns than the usage tree does. For example, the pat-tern Checking schedule → Making phone calls → Checking voice mails → Checking emails → Sending emails is accepted by the DFA according to its DFA path: 0 → 1
→ 4 → 2 → 5 → 8 where the final state 8 is an accepting state, but the pattern does not exist in the tree. However, this feature may not be considered harmful because it may accept more “reasonable” patterns. For ex-ample, the above pattern is very legitimate, i.e. the users may as well operate their devices by using the pattern “checking schedule, making phone calls, checking voice mails, checking emails, and sending emails.”
• This approach misses an important piece of information, the event frequency. The Step B, Usage Data Preparation, of this method removes events with frequencies lower than a threshold value. Otherwise, Figure 6. An LACS2 illustrated through a trace
this DFA does not use the frequency infor-mation, which could be very useful.
• The pattern discovery is virtually not used in this research because the DFA uses all paths from the usage tree. Without using much pattern discovery, the usage tree and DFA may grow too large to be stored in the device.