• No results found

Computing the S* substring alphabet

In document Holt_unc_0153D_16498.pdf (Page 69-72)

CHAPTER 5: MULTI-STRING BWT CONSTRUCTION VIA INDUCED SORT-

5.3 Construction via induced sorting

5.3.1 Computing the S* substring alphabet

The algorithm begins by classifying each symbol in each string in the collection using the same classification scheme of BWT-IS [Okanohara and Sadakane, 2009]. In any stringT of length M, it assigns each symbol inT as being S-type or L-type. A symbolT[x] is considered S-type if the suffix T[x:M]< T[x+ 1 :M]. In other words, the suffix starting at indexx lexicographically precedes the one starting at (x+ 1). Otherwise, it is consideredL-type. Additionally, in any run of S-type symbols, the left-most S-type symbol is defined asS*-type. If T[x]is S*-type, it is a local minimum in the string because T[x :M] < T[x+ 1 : M] and T[x−1 : M] > T[x : M]. By the alphabet definition, all initializer symbols must be S*-type since they will always be the absolute minimum in

Figure 5.2: MSBWT-IS overview. This figure shows the outline for the induced sorting of MSBWT-IS. Starting at level i, the algorithm extract all S* substrings from the string collection. Then, it sorts the S* substrings and assigns each a new symbol in a new alphabet. Then, each S* substring from level iis replaced with its corresponding symbol from the new alphabet to create the new string collection at level (i+ 1). If the calculation of the BWT of this new string collection is trivial, the algorithm will simply compute it. Otherwise, it will perform a recursive call on the new string collection. Once the BWT is obtained through either recursion or a trivial computation, all suffixes from the S* substrings are sorted against each other and used to assign an offset into the BWT at leveli. Finally, these offsets and the BWT from level(i+ 1)are used to induce the solution to the BWT at leveli.

String Index c(integer[c]) Type S* Substring (integer[S*]) 0 $ (0) S* $TA (041) 1 T (4) L 2 A (1) S* AGC (132) 3 G (3) L 4 C (2) S* CT$ (240) 5 T (4) L 0 $ (0) S* $GA (031) 1 G (3) L 2 A (1) S* AGC (132) 3 G (3) L 4 C (2) S* CG$ (230) 5 G (3) L

Table 5.1: S* substring example - level 0. This table shows the process for identifying S* substrings in string collectionσ0={“$TAGCT", “$GAGCG"}. All symbols that are local minima (based on integer[c]) are classified as S* type. The S* substrings are the strings from one S*-type up to and including the next S*-type symbol. All initializer symbols (‘$’) are S*-type because an initializer symbol will always be an absolute minimum in the string.

any given string, so the first symbol in every string is S*-type. Table 5.1 shows how each symbol is labeled for the example collection containing two strings. Note that finding all S*-type symbols can be done for any string of length M in O(M) steps [Okanohara and Sadakane, 2009]. To find the S*-type symbols in a string collection, the algorithm simply applies the same method on each string one at a time until all S*-type symbols are known. For a string collection of combined string length Ni, the algorithm can find the S* substrings in the entire collection inO(Ni) steps.

Let an S* substring be defined as the substring from one S*-type symbol up to and including the next S*-type symbol [Okanohara and Sadakane, 2009]. Thus, the S* substrings for “$TAGCT" are “$TA", “AGC", and “CT$" (see Table 5.1). In our algorithm, whenever an S* substring is found, it is explicitly stored in a collection. For our implementation, we used a hash map as the collection, so all S* substrings can be identified and stored in the collection inO(Ni)steps.

Once all S* substrings are in the collection, the collection is sorted and each S* substring is assigned an integer symbol corresponding to its order in the sort. Thus, each S* substring has a one-to-one correspondence with a symbol in a new alphabet. For example, in Table 5.1, there are five S* substrings with sorted order [“$GA", “$TA", “AGC", “CG$", “CT$"] that corresponds to symbols [0, 1, 2, 3, 4] in Σ1. It’s also worth noting that at this point, Σ1 has five total symbols (k1= 5), two of which correspond to initializer symbols (t1 = 2). This is trivially found by counting

String Index c Type S* Substring 0 1 S* 1241 1 2 S 2 4 L 0 0 S* 0230 1 2 S 2 3 L

Table 5.2: S* substring example - level 1. This table shows the process for identifying S* substrings inσ1 ={124,023}. All symbols that are local minima are classified as S* type. The S* substrings are the strings from one S*-type up to and including the next S*-type symbol. In this case, each string is entirely contained within one S* substring each.

the number of S* substrings in the collection that begin with an initializer symbol. In this example, both “$GA" and “$TA" start with ‘$’, so ‘0’ and ‘1’ must be initializer symbols inΣ1.

Using the new alphabet Σi+1 and the input string collection σi, the algorithm then creates a new string collection σi+1. For each string T in σi, it replaces each S* substring in T with its corresponding value in the S* substring collection. For example, in Table 5.1, the string “$TAGCT" is composed of three S* substrings: “$TA", “AGC", and “CT$". Using the sorted alphabet, these S* substrings correspond to 1, 2, and 4 fromΣ1, so string “124" is added to σ1. Similarly, the second string “$GAGCG" is added as “023" toσ1.

In document Holt_unc_0153D_16498.pdf (Page 69-72)