Dynamic Programming Application of Problem with Optimal Subsequence

(1)

33 Tropical Journal of Applied Natural Sciences

Trop. J. Appl. Nat. Sci., 2(3): 33-38 (2019) ISSN: 2449-2043

https://doi.org/10.25240/TJANS.2019.2.3.04 Available online: https://tjansonline.com

Dynamic Programming Application of Problem with Optimal Subsequence

M. Laisin

Department of Mathematics

Chukwuemeka Odumegwu Ojukwu University, Uli Campus Anambra State, Nigeria

*Corresponding author’s email: [email protected]

1. Introduction

ynamic programming (DP) offers a unified approach to solving multi-stage optimal control problems (Bellman, 1957;

Bertsekas, 2000). The process control community has disregarded DP due to its unwieldy computational complexity referred to as ‘curse-of-dimensionality.’ As a result, the community has studied DP only at a conceptual level (Morari and Lee, 1999), and there have been few applications. In contrast, the artificial intelligence (AI) and machine learning (ML) fields over the past two decades have made significant strides towards using DP for practical problems, as evidenced by some review papers (Kaelbling et al, 1996) and textbooks (Bertsekas and Tsitsiklis, 1996). Though a myriad of approximate solution strategies has been suggested in the context of DP, they are mostly tailored to suit the characteristics of applications in Operations Research (OR), Computer Science, and robotics. Since the characteristics and requirements of these applications differ considerably from those of process control problems, these approximate methods should be understood and interpreted carefully from the viewpoint of process control before they can be considered for real process control problems.

However, much of the recent research on dynamic programming appears to deal with methods devised to overcome the limitations of discrete dynamic programming, and researchers have proposed several useful methods over the years. These methods, known as "successive approximation methods" include differential DP for example, Jacobson and Mayne, (1970) for unconstrained optimal control problems, and Murray and Yakowitz (1979) for problems with linear constraints). Similarly, discrete differential DP by Heidari et al., (1971), state incremental DP by Larson, (1968), nonlinear programming algorithms by Lee and Waziruddin, (1970); Gagnon et al., (1974); Chu and Yeh, (1978), and a discrete maximum principle algorithm by Papageorgiou, (1985). In some of these methods, they have avoided discretization of the state space. .

2. Longest Common Subsequence (LCS)

Based on dynamic programming constructs, LCS compares two strings of data, character by character in a recursive fashion and can return either the length of the subsequence common to either inputs, or the actual subsequence itself.

Thus, the longest common subsequence, as a computer science problem, is concerned with the search for an efficient method of finding the longest common subsequence of two or more sequences. A milestone in the study of LCS is the development of

D

ABSTRACT

In this paper, we discuss and compare various implementations of the Longest Common Subsequence (LCS) algorithm in terms of both complexity and practical performance. In this investigation, we also look at existing space- and time optimizing algorithms. Furthermore, we shall construct a formulae for the sequences 𝑋 = (𝑥₁, 𝑥₂, . . . , 𝑥_𝑚) and 𝑌 = (𝑦₁, 𝑦₂, . . . , 𝑦_𝑛) such that the LCS of X and Y is 𝑍 = (𝑧₁, 𝑧₂, . . . , 𝑧_𝑘) for a line with no two selected having unit separation, k-separable inclusion and k-separable non-inclusion.

Original Research Article

Received: 21^st Apr., 2018.

Accepted: 10^th Oct., 2018.

Published: 11^th Feb., 2019.

Keywords:

Arrangement, dynamic programming, k-separable inclusion, k-separable non- inclusion.

(2)

34

dynamic programming algorithms [Hirschberg, 1977; Bergroth et al., 2000], which calculates LCS in polynomial time. In addition, LasseBergroth, HarriHakonen, and TimoRaita, (2000) in a workshop discussed the Longest Common Subsequence Problem (LCS), as an algorithm that can be applied to many different problems. It is useful in deciding whether two strings of DNA are homologous and to compare files as used in the Linux diff utility. Real world advantages in both cases are obvious.

Combined with the use of regular expressions, we would be able to detect cheating or advance the field of comparative genomics (both are noble causes). There is great interest in studying the applications of LCS algorithms because, at the most fundamental level, it ties Biology and Computer Science together. LCS is a simple algorithm and yet has a variety of practical applications. The present results of years of dedicated effort by top Mathematicians and Computer Scientists has resulted in the development of two tools, Global Alignment System (GLASS) and Rosetta. Batzoglou and Pachter(2007) developed a software for implementations that allow for aligning pairs of homologous sequences, and for predicting exons on a target human sequence, based on homologous comparisons from a different species. Both are real world applications of Longest Common Subsequence algorithms. The dynamic programming algorithm for calculating LCS was originally designed to work for one- dimensional sequences of discrete values but Vlachos et al. (2003) extended it to work for multidimensional, real-valued sequences

2.2 Complexity of the childlike Solution

To begin with complexity analysis, it would be prudent to start with the fundamentals. When using Longest Common Subsequence algorithms, we are dealing with inputs that are textual and one-dimensional in nature. These inputs are composed of characters that belong to a certain alphabet. Consider the problem in two ways, either where strings of characters are unitary symbols separated by whitespace. This, though, will lead to an infinite alphabet of symbols. Hence, it might seem more intuitive to take each character as a symbol in our alphabet, resulting in a sequence that is a string over a finite alphabet. We aim to find patterns in the input sequences to an LCS problem. Since we intend to analyses genetic code, we expect to find syntactical structure in our strings; it would be best to treat patterns as subsequences (as opposed to substrings). Nevertheless, how many such patterns exist in any sequence of length n? Let us first look at substrings. For any length k ≤ n there are n−k +1 substrings in the sequence.

∑ 𝑛 − 𝑘 + 1 = ∑ 𝑘 =(𝑛 + 1)𝑛

2 =𝑛²+ 𝑛

2 (1)

𝑛

𝑘=0 𝑛

𝑘=0

This shows the number of substrings is of the order 𝑛². On the other hand, for subsequences the result is signiﬁcantly more depressing. Using a binary operator, any symbol in the sequence is either in the subsequence or not in it. Therefore, there are a total of 2^𝑛 subsequences. There is, on the upside, only one maximal recurring subsequence, indicating that the problem can be restricted to improve our odds of completing execution in a reasonable amount of time (Graham Stephen, 1992). If we keep things the way they are, we end up with time complexity of approximately 𝑂(2^𝑛) when m ≈ n. Space complexity is just as bad since everything is stored until the algorithm has ﬁnished execution.

2.3 Advantages of Memoization

Rizvi and Pankaj Agarwal, (2006), discussed a new bucket-based algorithm for finding LCS from two given molecular sequences but Memoization is a top-down programming method which allows you to store the results of calculations that are often repeated in order to speed up the execution of an algorithm. In this variant of dynamic programming, one maintains the style of the recursive solution but inserts a table look-up before any calculation. If a solution has already been computed, the resulting value is used instead of unnecessarily repeating the computation – making a program more efficient.. Introducing memoization to the LCS algorithm results in a significant speed-up despite the additional checks being done to see whether a solution already exists. Assuming constant time for calling the function on a subproblem, each time we extend the common subsequence, we deal with at most two new subproblems. Knowing the length of sequences X and Y are m and n respectively, we see that the complexity has been reduced to 𝑂(𝑚 × 𝑛). Theoretically, recursive function calls such as those found in the recursive and the memoized solution offer advantages since they do not store results in local variables. Practically though, this advantage is less visible due to recursive functions being more memory intensive during execution of 𝐶++.the language of choice for this project, arranges memory space for functions in a stack. As soon as execution of a function is completed, the relevant function data is removed from the stack in a LIFO fashion. It is precisely because of these stack manipulations that recursive implementations run slower and use more memory than iterative implementations, although they often make the code more readable. (See: http://www.doc.ic.ac.uk/wjk/C++Intro/

RobMillerL8.html)

The Longest Common Subsequence problem as stated above can be implemented recursively at the cost of speed and memory usage. Non-the-less it remains an important step in understanding the fundamentals of the structure of this algorithm. In essence, everything boils down to the existence of optimal substructure in the LCS, expressed by the recurrence relation already presented in the previous section. Thomas Cormen, Charles Leiserson, RonaldRivest, and CliffordStein (2001) gave a main result standing behind all Longest Common Subsequence algorithms and an explicit proof of optimal substructure.

2.4 Optimal Substructure of an LCS

Let 𝑋 = (𝑥₁, 𝑥₂, . . . , 𝑥_𝑚) and 𝑌 = (𝑦₁, 𝑦₂, . . . , 𝑦_𝑛)be sequences, and let 𝑍 = (𝑧₁, 𝑧₂, . . . , 𝑧_𝑘)be any LCS of X and Y.

 If 𝑥𝑚= 𝑦𝑛 then 𝑥𝑚= 𝑦𝑛= 𝑧𝑘 and 𝑍𝑘−1an LCS of𝑋𝑚−1 and 𝑌𝑛−1

 If 𝑥_𝑚≠ 𝑦_𝑛 then 𝑧_𝑘 ≠ 𝑥_𝑚implies that Z is an LCS of 𝑋_𝑚−1 and Y

 If 𝑥𝑚≠ 𝑦𝑛 then 𝑧𝑘 ≠ 𝑦𝑛implies that Z is an LCS of 𝑋 and 𝑌𝑛−1

Proof (Thomas Cormen, Charles Leiserson, Ronald Rivest, and Clifford Stein, 2001)

(3)

35

2.5.1 Basic Idea (version 1):

What we want to do is take our problem and somehow break it down into a reasonable number of subproblems (where

“reasonable” might be something like 𝑛² ) in such a way that we can use optimal solutions to the smaller subproblems to give us optimal solutions to the larger ones. Unlike divide-and-conquer (as in mergesort or quicksort) it is OK if our subproblems overlap, so long as there are not too many of them.

2.5.2 Basic Idea (version 2)

Suppose you have a recursive algorithm for some problem that gives you a bad recurrence like 𝑇(𝑛) = 2𝑇(𝑛 − 1) + 𝑛.

However, suppose that many of the subproblems you reach as you go down the recursion tree are the same. Then you can hope to get a big savings if you store your computations so that you only compute each different subproblem once. You can store these solutions in an array or hash table. This view of Dynamic Programming is often called memorizing

2.5.3 Basis:

If either sequence is empty, then the longest common subsequence is empty. Therefore, 𝐿𝐶𝑆(𝑖, 0) = 𝐿𝐶𝑆(𝑗, 0) = 0.

2.6 Bellman’s Principle of Optimality

Suppose that the functions 𝑓₁(𝑥₁), 𝑓₂(𝑥₂), . . ., 𝑓_𝑛(𝑥_𝑛), are such that the objective function is as follows:

𝐹(𝑥₁, 𝑥₂ , . . ., 𝑥_𝑛) = 𝑓₁(𝑥₁) + 𝑓₂(𝑥₂)+. . . + 𝑓_𝑛(𝑥_𝑛) (2)

The problem pins down to finding the value of the variable 𝑥₁, 𝑥₂ , . . ., 𝑥_𝑛when the objective function has the maximum value subject to the following constraints:

𝑎₁𝑥₁+ 𝑎₂𝑥₂ +. . . + 𝑎_𝑛𝑥_𝑛≤ 𝑏_𝑛 Where by 𝑎𝑗≥ 0, 𝑥𝑗≥ 0, 𝑏𝑛 ≥ 0 , 𝑗 = 1, . . . , 𝑛

For arbitrary 𝑘 = 1, . . . , 𝑛we introduce a set of functions {𝐹_𝑘 (𝑏_𝑘)}in the following way:

𝐹1 (𝑏1 ) = max

𝑎₁𝑥₁≤𝑏₁{𝑓1(𝑥1)}

= max

𝑥₁≤𝑏₁/𝑎₁{𝑓1(𝑥1)} (3) 𝐹₂ (𝑏₂ ) = max

𝑎₁𝑥₁+𝑎₂𝑥₂≤𝑏₂{𝑓₁(𝑥₁) + 𝑓₂(𝑥₂)}

= max

𝑎₂𝑥₂≤𝑏₂{ max

𝑎₁𝑥₁− 𝑎₂𝑥₂[𝑓₁(𝑥₁) + 𝑓₂(𝑥₂)]}

= max

𝑎₂𝑥₂≤𝑏₂{𝑓₂(𝑥₂) + max

𝑎₁𝑥₁≤𝑏₂− 𝑎₂𝑥₂[𝑓₁(𝑥₁)]}

= max

𝑎₂𝑥₂≤𝑏₂{𝑓2(𝑥₂) + 𝐹₁(𝑏₂− 𝑎2𝑥2)}.

In general case we get the expression as in the following;

𝐹_𝑘 (𝑏_𝑘 ) = max

𝑥_𝑘≤𝑏_𝑘/𝑎_𝑘{𝑓𝑘(𝑥_k) + 𝐹_𝑘−1(𝑏_k− 𝑎_k𝑥_k)}, k = 2, … , n , (4)

where 𝐹₀(𝑏₀) = 0The equations (3) and (4) are known as the Bellman Equation (see Bellman, 1959).

MAIN RESULTS

Theorem 3. (Restricted Combinatorics)

Let 𝛽(𝑛,𝑚) denote the number of ways of selecting m elements from n elements arranged in a line with no two selected having unit separation such that𝑋 = (𝑥₁, 𝑥₂, . . . , 𝑥_𝑚) and 𝑌 = (𝑦₁, 𝑦₂, . . . , 𝑦_𝑛)be sequences, and let 𝑍 = (𝑧₁, 𝑧₂, . . . , 𝑧_𝑘)be any LCS of X and Y,

𝛽_(𝑛,𝑚)= (Ψ − r + 1

𝑟 ) , if 0 ≤ r ≤ Ψ (5) Where Ψ is either 𝑚 or 𝑛.

Proof

This is the main theorem standing behind all Longest Common Subsequence algorithms generated by the arrangements of X and Y. We shall give a brief outlined here for the sake of completeness by considering the following three cases; if

a. 𝑥_𝑚= 𝑦_𝑛 then 𝑥_𝑚= 𝑦_𝑛= 𝑧_𝑘 and 𝑍_𝑘−1an LCS of 𝑋_𝑚−1 and 𝑌_𝑛−1 b. 𝑥_𝑚≠ 𝑦_𝑛 then 𝑧_𝑘 ≠ 𝑥_𝑚implies that Z is an LCS of 𝑋_𝑚−1 and Y c. 𝑥𝑚≠ 𝑦𝑛 then 𝑧𝑘 ≠ 𝑦𝑛implies that Z is an LCS of 𝑋 and 𝑌𝑛−1

Thus, the above result shows that in the recursive solution we have to deal with either one or two subproblems depending on some conditions. In essence, every LCS of two sequences contains an LCS of the preﬁxes of these sequences;

(4)

36

1. in the first case (a), if 𝑥𝑚= 𝑦𝑛, then we have a match and hence we will continue by trying to find the LCS of the prefix of X and the prefix of Y as in the program below;

LCS (Y, n, X,m) { If (n=0 || m=0) return 0;

If (Y[n] = X[m]) result = 1 + LCS (Y, n-1, X, m-1); // no harm in matching up Otherwise the result = 𝑚𝑎𝑥 ( 𝐿𝐶𝑆(𝑌, 𝑛 − 1, 𝑋, 𝑚), 𝐿𝐶𝑆(𝑌, 𝑛, 𝑋, 𝑚 − 1) );

Return to result;}

This algorithm runs in exponential time. In fact, if Y and X are completely disjoint sets of characters (so that we never have 𝑌[𝑛] = 𝑋[𝑚]) then the number of times that LCS(Y,1,X,1) is recursively called equals (𝑛 + 𝑚 − 2

𝑚 − 1 ). In the memoized version, we store results in a matrix so that any given set of arguments to LCS only produces new work (new recursive calls) once. The memoized version begins by initializing 𝑎𝑟𝑟[𝑖][𝑗] to unknown for all 𝑖, 𝑗, and then proceeds as follows:

2. in the second case (b) and (c); if 𝑥_𝑚≠ 𝑦_𝑛,then we have not found a match. So we must look at the i. preﬁx of X, and Y and

ii. X, and preﬁx of Y, and return the longer result – which will be an LCS of both X and Y as in the program below.

LCS(Y, n, X,m) { If (n=0 || m=0) return 0;

If (𝑎𝑟𝑟[𝑛][𝑚] = ? 𝑢𝑛𝑘𝑛𝑜𝑤𝑛) return 𝑎𝑟𝑟[𝑛][𝑚]; // < − added this line (6) If (Y[n] = X[m]) result = 1 + LCS (Y, n-1, X,m-1);

Otherwise the result = 𝑚𝑎𝑥( 𝐿𝐶𝑆(𝑌, 𝑛 − 1, 𝑋, 𝑚), 𝐿𝐶𝑆(𝑌, 𝑛, 𝑋, 𝑚 − 1) );

𝑎𝑟𝑟[𝑛][𝑚] = 𝑟𝑒𝑠𝑢𝑙𝑡; // < − and this line (7) Return to result;}

All we have done is saved our work in line (7) and made sure that we only embark on new recursive calls if we have not already computed the answer in line (6).

At this point it becomes clear how the recursive solution of (3) is derived (Bellman, 1959). Every Longest Common Subsequence of length k is composed of a subsequence that preﬁxes it of length k −1 all the way down to k = 0. It is shown that all preﬁxes of the k-length LCS are also LCSs.

Theorem 3.1 (The k-separable inclusion)

Suppose that 𝑋 = {𝑥𝑖∶ 𝑖 = 1,2, . . . , 𝑛} is a ﬁnite set and let 𝐾 = {𝑥𝑖𝑗 ∶ 𝑗 = 1,2, . . . , 𝑘} be a sub-set of X. We consider the problem of selecting r(r ≤ n) genes from X in such a way that each selection contains the entire k-genes but in such a way that any two k-genes are not together is given by;

𝑃_{𝑠𝑖(𝑛,𝑟,𝑘)} =(𝑛 − 𝑘)!

(𝑛 − 𝑟)!

(𝑟 − 𝑘 + 1)!

(𝑟 − 2𝑘 + 1)!, ∀ 𝑘(𝑘 ≥ 1) 𝑒𝑙𝑒𝑚𝑒𝑛𝑡𝑠 (2𝑘 − 1 ≤ 𝑟 ≤ 𝑛) Proof

Given that K is a set of special two DNA fragments of interest; then the LCS arise when there is at least a subset of special interest in the arrangement of the genes. However, the Combinatorial is assumed and pairs of two strings are taken such that string 𝑋 of length r, and string 𝑌 of length r are taken to determine the longest common subsequence such that the longest sequence of characters that appear left-to-right (but not necessarily in a contiguous block) in both strings.

The results follow by solving for 𝐿𝐶𝑆[𝑖, 𝑗] in terms of the LCS’s of the smaller problems we consider the following two cases;

Case 1: what if 𝑋[𝑖]10 = 𝑌[𝑗]? Then, the desired subsequence has to ignore one of 𝑋[𝑖] 𝑜𝑟 𝑌[𝑗] so we have: 𝐿𝐶𝑆[𝑖, 𝑗] = 𝑚𝑎𝑥(𝐿𝐶𝑆[𝑖 − 1, 𝑗], 𝐿𝐶𝑆[𝑖, 𝑗 − 1]).

Case 2: what if 𝑋[𝑖] = 𝑌[𝑗]? Then the 𝐿𝐶𝑆 𝑜𝑓 𝑋[1. . 𝑖] 𝑎𝑛𝑑 𝑌[1. . 𝑗] might as well match them up. For instance, given a common subsequence that matched 𝑋[𝑖] to an earlier location in 𝑌, for instance, you could always match it to 𝑌[𝑗] instead.

Therefore, in this case we have:

𝐿𝐶𝑆[𝑖, 𝑗] = 1 + 𝐿𝐶𝑆[𝑖 − 1, 𝑗 − 1].

Therefore, we can just do two loops (over values of 𝑖and 𝑗), thus, applying the rules we obtain the LCS.

𝑥₁ 𝑥₂ 𝑥₃ ⋯ 𝑥_𝑟

𝑥1

𝑥₂ 𝑥3

⋮ 𝑥_𝑟 Figure 1;

Here is what it looks like pictorially for the example above, with X along the left most column and Y along the top row. (Moore C.,M..Laisin M. and O.C.Okoli, 2011 for the combinatorial prove of r-arrangements)

How to determine the sequence.

To find the sequence, we just walk backwards through matrix starting the lower-right corner. If either the cell directly above or directly to the right contains a value equal to the value in the current cell, then move to that cell (if both two, then chose

(5)

37

either one). If both such cells have values strictly less than the value in the current cell, then move diagonally up-left (this corresponds to applying Case 2), and output the associated character. The output of the characters in the LCS is in a reverse order. Considering figure 2 below, the output on the matrix is 𝑥_𝑗𝑥_𝑒𝑥_𝑖𝑥_𝑑𝑥_𝑏𝑥_𝑓𝑥_𝑎.

Theorem 3.2 (The k-separable non-inclusion)

Suppose that K is a set of special two DNA fragments of interest; then the LCS arise when there is at least a subset of special interest and suppose that 𝑋 = {𝑥_𝑖∶ 𝑖 = 1,2, . . . , 𝑛} is a ﬁnite set and let 𝐾 = {𝑥_𝑖𝑗 ∶ 𝑗 = 1,2, . . . , 𝑘} be a sub-set of X. We consider the problem of selecting r (r ≤ n) genes from X in such a way that each selection contains only some part of K and not the entire k-genes but in such a way that any two k-genes are not together is given by;

𝑃𝑠𝑛𝑖(𝑛, 𝑟, 𝑘) =

{

∑𝑃_{(𝑛−𝑘,𝑟−𝑖)}𝑃_{((𝑟−𝑖+1,𝑖)}𝑃_(𝑘,𝑖)

(𝑖)! , 𝑖𝑓 𝑟 ≥ 𝑘 𝑎𝑛𝑑 𝑟 + 𝑘 ≤ 𝑛

𝑘−1

𝑖=0

∑ 𝑃(𝑛−𝑘,𝑟−𝑖)𝑃((𝑟−𝑖+1,𝑖)𝑃(𝑘,𝑖)

(𝑖)! , 𝑖𝑓 𝑟 ≥ 𝑘 𝑎𝑛𝑑 𝑟 + 𝑘 > 𝑛

𝑘−1

𝑖=𝑟+𝑘−𝑛

∑𝑃_{(𝑛−𝑘,𝑟−𝑖)}𝑃_{((𝑟−𝑖+1,𝑖)}𝑃_(𝑘,𝑖) (𝑖)!

𝑟

𝑖=0

, 𝑖𝑓 𝑟 < 𝑘 𝑎𝑛𝑑 𝑟 + 𝑘 ≤ 𝑛

∑ 𝑃(𝑛−𝑘,𝑟−𝑖)𝑃((𝑟−𝑖+1,𝑖)𝑃(𝑘,𝑖)

(𝑖)!

𝑟

𝑖=𝑟+𝑘−𝑛

, 𝑖𝑓 𝑟 < 𝑘 𝑎𝑛𝑑 𝑟 + 𝑘 > 𝑛

This the r-arrangements for n distinct elements (𝑟 ≤ 𝑛 ) with the non-inclusion of a ﬁxed k number of elements (0 ≤ 𝑖 <

𝑘 ≤ 𝑛) such that they are always separate.

Proof

Theorem 3.3 follows from the proof of theorem 3.2. By the arrangement in theorem 3.3 it is clear that the LCSs can be selected such that a pair of two strings are taken such that string 𝑋 of length r, and string 𝑌 of length r are taken to determine the longest common subsequence such that the longest sequence of characters that appear left-to-right (but not necessarily in a contiguous block) in both strings contain the k-separable non-inclusion property.

4. APPLICATIONS

Example 4.1: Longest Common Subsequence

Suppose we are given two strings: string 𝑋 of length n, and string 𝑌 of length m. What is the longest common subsequence such that the longest sequence of characters that appear left-to-right (but not necessarily in a contiguous block) in both strings?

Solution

For example, consider:

𝑋 = 𝑥_𝑎𝑥_𝑓𝑥_𝑏𝑥_𝑔𝑥_𝑐𝑥_ℎ𝑥_𝑑𝑥_𝑖𝑥_𝑒𝑥_𝑗 𝑌 = 𝑥𝑔𝑥𝑎𝑥𝑓𝑥𝑏𝑥𝑑𝑥𝑖𝑥ℎ𝑥𝑒𝑥𝑗𝑥𝑡

In this case, the LCS has length 7 and is the string 𝑥_𝑎𝑥_𝑓𝑥_𝑏𝑥_𝑑𝑥_𝑖𝑥_𝑒𝑥_𝑗. Another way to look at it is to determine a 1-1 matching between some of the letters in 𝑋 and some of the letters in 𝑌 such that none of the edges in the matching cross each other. For instance, this type of problem comes up all the time in genomics: given two DNA fragments, the LCS gives information about what they have in common and the best way to line them up. Let us now solve the LCS problem using Dynamic Programming.

As subproblems we will look at the LCS of a prefix of 𝑋 and a prefix of 𝑌, running over all pairs of prefixes. For simplicity, let’s worry first about determining the length of the LCS and then we can modify the algorithm to produce the actual sequence itself. Therefore, here is the question: say 𝐿𝐶𝑆[𝑖, 𝑗] is the length of the 𝐿𝐶𝑆 𝑜𝑓 𝑋[1. . 𝑖] 𝑤𝑖𝑡ℎ 𝑌[1. . 𝑗].

Now, to solve for 𝐿𝐶𝑆[𝑖, 𝑗] in terms of the LCS’s of the smaller problems we consider the following two cases;

Case 1: what if 𝑋[𝑖]10 = 𝑌[𝑗]? Then, the desired subsequence has to ignore one of 𝑋[𝑖] 𝑜𝑟 𝑌[𝑗] so we have: 𝐿𝐶𝑆[𝑖, 𝑗] = 𝑚𝑎𝑥(𝐿𝐶𝑆[𝑖 − 1, 𝑗], 𝐿𝐶𝑆[𝑖, 𝑗 − 1]).

Case 2: what if 𝑋[𝑖] = 𝑌[𝑗]? Then the 𝐿𝐶𝑆 𝑜𝑓 𝑋[1. . 𝑖] 𝑎𝑛𝑑 𝑌[1. . 𝑗] might as well match them up. For instance, given a common subsequence that matched 𝑋[𝑖] to an earlier location in 𝑌, for instance, you could always match it to 𝑌[𝑗] instead. So, in this case we have:

𝐿𝐶𝑆[𝑖, 𝑗] = 1 + 𝐿𝐶𝑆[𝑖 − 1, 𝑗 − 1].

Therefore, we can just do two loops (over values of i and j), filling in the LCS using these rules. Here is what it looks like pictorially for the example above, with X along the left most column and Y along the top row.

𝑥𝑔 𝑥𝑎 𝑥𝑓 𝑥𝑏 𝑥𝑑 𝑥𝑖 𝑥ℎ 𝑥𝑒 𝑥𝑗 𝑥𝑡

𝑥𝑎 0 1 1 1 1 1 1 1 1 1

𝑥_𝑓 0 1 2 2 2 2 2 2 2 2

𝑥_𝑏 0 1 2 3 3 3 3 3 3 3

𝑥𝑔 1 1 3 3 3 3 3 3 3 3

𝑥𝑐 1 1 4 4 4 4 4 4 4 4

𝑥ℎ 1 1 4 4 4 4 5 5 5 5

(6)

38

𝑥𝑑 1 1 4 4 5 5 5 5 5 5

𝑥𝑖 1 1 4 4 5 6 6 6 6 6

𝑥𝑒 1 1 4 4 5 6 6 7 7 7

𝑥_𝑗 1 1 4 4 5 6 6 7 8 8

Figure 2: The standard table for LCS.

We just ﬁll out this matrix row by row, doing constant amount of work per entry, so this takes 𝑂(𝑚 × 𝑛) time overall. The ﬁnal answer (the length of the LCS of X and Y) is in the lower-right corner.

𝐿𝐶𝑆 = 𝑥_𝑗𝑥_𝑒𝑥_𝑖𝑥_𝑑𝑥_𝑏𝑥_𝑓𝑥_𝑎 5. CONCLUSION

The dynamic solution to the Longest Common Subsequence problem does not depend on any generalizations. Using the same recurrence relation 2, we ﬁnd a bottom-up solution to the problem. If we go back to the discussion of subproblems in LCSs, we see that according to our recursive definition, some subproblems overlap. Depending on conditions, certain subproblems can be eliminated. For example, when𝑥𝑖= 𝑦𝑗, only the subproblem of finding the LCS of 𝑋𝑖−1 and 𝑌𝑗−1 should be considered.

Restricting the search space in this way allows us to asymptotically speed up our algorithm. Shrinking the number of subproblems allows us to use dynamic programming constructs to solve the problem. The result is a formulation of LCS that is asymptotically just as fast as the memoization technique, but due to iteration significantly faster in practice. Intermediate data is stored in a matrix that has dimensions (𝑚 × 𝑛), and there may be another matrix of the same size used to trace back the LCS itself after the main execution is over. Since this is a constant factor, the space complexity is 𝑂(𝑚𝑛).following this simple trend, it is reasonable to expect that improving the data structures and precise expression of a problem will dramatically increase the speed of execution even though time complexity remains unchanged. Since the LCS problem is NP Hard for any number of inputs greater than 1, it is to put it lightly – rather difficult to find a sub-quadratic solution for LCSs in their current form.

Therefore, at this point, we might decide to generalize the LCS problem in search for better time complexity, or conversely, we might try to improve space complexity.

REFERENCES

Almeida, N. F. Jr, Alves, C. E. R. Caceres, E. N. and Song, S. W. (2003). Comparison of genomes using high-performance parallel computing. Proc. of the15th Symposium on Computer Architecture and High-Performance Computing (SBAC-PAD 03), IEEE.

Batzoglou, S. and Pachter, L. (2007); Mitlcs/cgr comparative genomics, [Online; accessed 12- March-2018].

Bellman, R. E. (1959). Dynamic Programming, Princeton University Press, New Jersey.

Daniel S. Hirschberg. (1975). A linear space algorithm for computing maximal common subsequences, Commun. ACM118 (6): 341–343.

Goeman, H. and Clausen, M. (1999). A new practical linear space algorithm for the longest common subsequence problem. Proceedings of the Prague Stringology Club Workshop 99 (Bonn, Germany), Universitat Bonn, pp. 1–21.

Graham A. Stephen, (1992). String search, School of Electronic Engineering Science, University College of North Wales.

Hirschberg Daneil.S. (1977). Algorithms for the longest common subsequence problem. Journal of ACM, 24(4):664–675,

James W. Hunt and Thomas G. Szymanski, (1977). A fast algorithm for computing longest common subsequences, Commun.ACM , 20(5): 350–353.

Kuo, S. and Cross, G. R. (1989). An improved algorithm to find the length of the longest common subsequence of two strings. SIGIR Forum, no. 3-4, pg.89–

99.

Lasse Bergroth, HarriHakonen, and TimoRaita (2000). A survey of longest common subsequence algorithms. Proc. of 7th International Workshop String Processing and Information Retrieval, vol. 7, SPIRE, pp. 39–48.

Moore C., Laisin, M. and Okoli, O.C. (2011). Generalized r-permutation and r-combination techniques for k-separable inclusion. International Journal of Applied Mathematics and Statistics, 23 (D11); 46-53.

Moore C. M. Laisin. M. and Okoli, O.C. (2011). Generalized r-permutation and r-combination techniques for k-separable non-inclusion. Global Journal of Pure and Applied Mathematics, 7(2):129-139.

Paola Bonizzoni, Gianluca Della Vedova, Riccardo Dondi, Guillaume Fertin, and Stephane Vialette, (2006). Exemplar longest common subsequence, Proc.

of 2^ndInternational Workshop on Bioinformatics Research and Applications IWBRA, LNCS, 3992: 791–798.

Rizvi, S.A.M. and Pankaj Agarwal, . (2006). A new bucket-based algorithm for finding lcs from two given molecular sequences, Proc. of the Third International Conference on Information Technology: New Generations (ITNG’06), IEEE.

Suratna Budalakoti, Ashok N. Srivastava, Ram Akella. and Eugene Turkov. (2000). Anomaly detection in large sets of high dimensional symbol sequences, University of California, Santa Cruz.

Thomas H., Cormen, Charles E., Leiserson, Ronald L., Rivest, and Clifford Stein, (2001).

Introduction to algorithms 2nd ed., The MIT Press.

Vlachos, Michail , Marios Hadjieleftheriou, Dimitrios Gunopulos, and Eamonn Keogh. (2003). Indexing multi-dimensional time-series with support for multiple distance measures. In Proceedings of SIGKDD03: ACM International Conference on Data Mining August 24-27, Washington, DC, USA,

How to cite this article

Laisin, M. (2019). Dynamic Programming Application of Problem with Optimal Subsequence. Tropical Journal of Applied Natural Sciences, 2(3): 33-38.

Licensed under a Creative Commons Attribution 4.0 International License