• No results found

To illustrate the algorithm, the following example is given.

Consider the two squences:

OJ = {iacbdegzaJ ad}

OJ = {bdciaegzbc}

After removing all unmatched elements (f in this case), we have two new sequences:

0;

= {iacbdegzad}

oj =

{bdciaegzbc}

Nondominated candidate first matches include i, b, and c. Note that d is dominated by b. The complete common sequence tree generated is shown in Figure 1.

~b

c

Nullnode / a - - e - - g - - z

\ / i ___

a

O - - b - - d ___

' " e - - g - - - z c - - b

~a

e - - g - - - z

Figure 4.1: Common Sequence Tree

Theorem 4.1 Algorithm LCS finds a Longest Common Subsequence for two original sequences.

Proof: The result follows from the fact that the algorithm considers all possi-ble common subsequences except those containing dominated matchings. As noted before, there must exist a LCS without a dominated matching since given such a sequence, a new common subsequence of equal or greater length is obtained by re-placing the dominated matching by the dominating operation. In making this switch,

no subsequent matchings are lost since the remaining partial sequences are at least as large for each original sequence.

Several additional observations regarding matching elements are useful for visual-izing the problem. If we connect each element in O~ and O~ with its matching(s) in the other sequence, we obtain the set of crossed pairs of elements shown in Figure 2.

We note the following:

1. All elements in O~(Oi) are connected to at least one element in O~(O~).

2. If two edges cross, then the two corresponding pairs of elements are in different order in 0; and O~, therefore they can not be in the same common sequence.

3. The elements of any set of non-crossed pairs/edges constitute a common se-quence of Oi and OJ.

4. A Longest Common Subsequence (LCS) corresponds to a largest set of non-crossed edges/pairs.

i a c b d e z a d

c a e 9 z c

Figure 4.2: LCS corresponds to a largest set of non-crossed edges

4.1.3 Shortest Composite Supersequence (SCS) Algorithm

The LCS for two sequences can be used to construct the SCS of those sequences.

For the original sequences i and j, without loss of generality assume that 10il $ 10jl.

The following algorithm entitled FindSC S is used to construct a composite operation supersequence for part i and j :

1. Find LC Sjj for part i and j.

2. Initialize the partial composite supersequence Pc to be the longest operation sequence (Pc

=

OJ).

3. Append all characters from the short word (Oi) that appear before the start of the LC Sij to the front of the Pc.

4. Append all characters that appear after the LC Sij in the short word to the end of the Pc.

5. Place the characters in the short word that are not part of the LC Sjj but fall between LC Sij characters, in the same position in the Pc.

Theorem 4.2 Algorithm FindSCS constructs a shortest combined supersequence for any two sequences OJ and OJ.

Proof: Any composite supersequence will have a number of characters (operations) at least equal to the sum of the characters for the two separate sequences minus the number of characters used for both original sequences. (Note that a character

can be used at most once for any sequence.) Thus, a lower bound on the length of a combined sequence is given by the sum of the sequence lengths minus the length of the longest common subsequence (LCS). This follows since if any larger set of characters could be used by both sequences, it would constitute a longer common subsequence, and this is a contradiction to the definition of a LCS. Last, according to the construction of a composite sequence by the SCS algorithm described above, the resultant composite supersequence will always have a length equal to this lower bound.

It remains only to show that the algorithm does construct a supersequence of both Oi and OJ. This follows from step 2 for OJ and from steps 3, 4, and 5 for Oi.

Finally, we need to establish that a composite sequence for a part family can be obtained by iteratively adding one new part type at a time to the SCS found.

Theorem 4.3 The Shortest Composite Subsequence (SCS) obtained by Algorithm FindSGS applied to one word and the SGS from a set of words produces a composite sequence for the set containing the original set plus the new word.

Proof: Using induction, the proof follows from Theorem 2 by noting that a shortest composite sequence is a composite sequence and that no subsequence is eliminated by adding additional operations.

Unfortunately, the sequential procedure does not guarantee a shortest composite supersequence for the entire family. To see this consider the sequences Oi = {acb} and OJ = {cab}. A shortest composite supersequence is {acab}. If we then combine this

with the sequence {eaeb}, we obtain the composite supersequence {cacab}. However, {caeb} is the SCS for the set of three original sequences. Even if all the SC S's (if not unique) are identified, it still remains possible that the composite sequence of a new word and a CS (ICSI

>

ISCSI) would be shorter than that of the new word and a SCS. For example, given Oi = {aed} and OJ = {abd}, a SCSij will be {abed}, and a C Sij could be {acdbd}. If the new sequence {cdbd} is added, then a composite sequence for CS and {cdbd} is {aedbd}. However, combining the SGS {abed} and {cdbd} yields {abcdbd}, which is longer than {acdbd}.

Returning to our original objective, based on the above definitions, we define the similarity coefficient as:

{ ILGSiil