PATTERN DETECTION - Dunham Data Mining pdf

Given a set of data values (d1 , d2, . . . , dn ) where di is collected at time ti and ti < tj

iff i < j , the pattern detection problem is to deterrnine a given pattern that occurs in this sequence. This can be viewed as a type of classification problem where the pattern to be predicted is one found in a given set of patterns. Typical pattern detection appli cations include speech recognition and signal proceS;sing. Spelling correctors and word processors also use simple pattern detection algorithms. Although these simpler cousins of the true data mining pattern detection problems are precise, the more general pattern detection problems are fuzzy with no exact matches. Approximations are needed. While humans are good at detecting such patterns, machines are not.

9.4.1 String Matching

The string matching problem assumes that both a long text document and a short pattern are given. The problem is to determine where the pattern is found in the text. Example 9.3 illustrates the pattern detection problem when it is applied to string matching. This prob lem is a common one, with many applications in word processing.

EXAMPLE 9.3

Martha Holder is editing her resume using a popular worq processor. She has just gotten married and wishes to change the name Holder to her new last name of Laros, where approptiate. Not all occurrences of Holder, however, should be changed. For example, she does not want to change the author's names of previous publications that were made under her maiden name. Using the word processor, she repeatedly finds all occurrences of Holder in the vita. She then must examine the context to determine whether it should be changed to Laros. In this case, the pattern being matched is (H, o, l, d, e, r). Only

258 Chapter 9 Temporal M i ni ng

words that are an exact match to this pattern should be found. Note that here each letter is viewed as if it occurred at a later point in time. In actuality, it is a later point in the

document.

One of the earliest string matching algorithms is the Knuth-Morris-Pratt or KMP algorithm. KMP creates a finite state machine (FSM), which is used to recognize the given pattern. The FSM represents all possible states that exist when scanning a string to match the given pattern. Each node in the FSM relates to one of these states. Figure 9.7 shows an FSM created to recognize the pattern "ABAABA." Here there are seven states. State i represents the fact that the first i characters in the pattern match the most recent i characters in the string. State six is designated as the recognizer state with two concentric circles. The arcs in the graph are labeled with the character from the pattern that causes a transition between the two states as indicated. Transitions labeled with "*" indicate that this transition is taken with any other character found in the string. The KMP algorithm creates the FSM for a given pattern. The FSM can then be applied to the string by starting at the first chara

dt

er in the string. From a given state, the next character in the string determines which transition is taken. The accepting state of the FSM is reached only when the pattern is found in the string. The worst-case behavior of the application of the FSM is O (m + n), where m is the length of the pattern and n is the length of the string. The preprocessing phase to create the FSM is O (m) in space and time.

Another algorithm that builds on the KMP approach is called the Boyer-Moore, or BM, algorithm. The same FSM is constructed to recognize the pattern, but the pattern is applied to the string in a right-to-left pattern. For example, when looking for the string "ABAABA," if the sixth character in the string is not A, then we know that the pattern is not found in the string starting at the first character in the string. We also know that if the sixth character is neither an "A" nor a "B," then the pattern does not exist in the string starting at any of the first six characters. The BM needs only one comparison to determine this, while the KMP would have to examine all of the first six characters. Again, the BM is O (m + n) in the worst-case scenario, but the expected and best cases

FIGURE 9.7: FSM for string "ABAABA."

Section 9.4 Pattern Detection 259 are better than this. The actual performance depends (of course) on both the pattern and the string.

Even though KMP and BM are pattern recognition algorithms, they usually are not thought of as data mining applications. The identification of patterns in these earlier techniques is precise. Most data mining pattern matching applications are fuzzy; that is, the pattern being compared to (i.e., the class representative) and the object being classified will not match precisely. However, as we will see, there are more advanced pattern recognition algorithms that are similar in that graphical structures are built to specifically recognize a pattern. In effect, these true data inining applications build on these earlier non-data mining algorithms.

When examining text strings, it often is beneficial to determine the "distance" between one string and another. For example, spelling checkers use this concept to recommend corrections for misspelled words. Again, these usually are not thought of as data mining activities, but the distance measure technique we discuss here is often the basis for more advanced distance measure approaches. Suppose that we wish to convert A = (a1 , az , . . . , an } to B = (b1 , bz , . . . , bm} . The basic idea is to determine the minimum cost of steps that are needed to convert one string to another. There are three operations that can be performed to convert string A to string B. Starting at the first character in each string, each operation identifies what operation should be performed on A and B to change A to B . Each operation not only indicates specific functions to be performed but also associates a cost for it. The following assume that we are currently examining ai in A and b j in B :

• Match: Leave ai and bj as they are. New character in A i s ai+I and i n B i s bj+l ·

The cost of this operation is 0 if ai = b j ; otherwise the cost is oo.

• Delete: Drop ai from A. The new length of A is n -1 . The cost of this operation is 1 .

• Insert: Insert b j into A at position ai . All characters in A following the previous ai are shifted down one, and the new length of A is n + 1 . Next character in A is ai + 1 and in B is b j+ I · The cost of this operation is 1 .

The distance between string A and B i s then determined by the minimum total cost for all operations needed to convert A to B. For example; tHe distance from catch to cat is 2 because the c and h have to be deleted. Similarly, the distance from cat to hat is 2 because c must be deleted and h must be inserted. Example 9.4 illustrates the process.

EXAMPLE 9.4

Suppose that we wish to determine the distance between a string "apron" and "crayon." By looking at the strings, we see that we can match at most three characters: either a , o, n or r, o, n. Figure 9.8 illustrates the use of the first matching. Here the cost is 5 because we have to insert c, r, y and delete p, r. The figure shows that we can view the problem as a shortest path between two points: the top left comer and the bottom right corner.

260 Chapter 9 Temporal Mi ning D Insert

iM;

e _� t e a • p • • 0 • n • c a _y 0 • • • • • • • • • • • • • • • • •

FIGURE 9.8: Convert apron to crayon.

•

In document Dunham Data Mining pdf (Page 135-137)