SufTrautilizes suffix tree data structure to represent the search trajectories of a target algorithm. It extracts compact features from search trajectories for similarity calcula- tion using the cosine similarity technique. SufTraaddressesCluPaTra’s limitations as follows:
1. Scalability: We propose a linear time algorithm for both Suffix Tree construc- tion and traversal; and
2. Flexibility: We generate compact patterns from search trajectories and use them as features. The patterns may occur in multiple segments along the search tra- jectory, so suffix trees enable us to consider multiple-segment similarities to improve clustering accuracy.
In SufTra, we use the basic sequence representation of search trajectory as de- scribed in subsection 3.2. Here, we only explore one sequence representation. Suf- Tra works in 4 stages: sequence hashing, suffix tree construction, features retrieval and instance-feature metric calculation. The details are as follows.
4.2.1
Sequence Hashing
In a search trajectory, several consecutive solutions may have similar solution proper- ties before the final improvement to reach the local optimum (for example 04L-04L-
04L-04L-04L-02P). We therefore compress the search trajectory sequence to a Hash String by removing the consecutive repetition symbols and store the number of repeti-
tions in a Hash Table to be used later in pair-wise similarity calculations. Hash String is the shorter version of the search trajectory after compressing all the repetition sym- bols. An example of Hash String from 04L-04L-04L-04L-04L-02P the is 04L-02P. If the sequence has a longer repetition, it should have a higher score because it contains more symbols. To store the number of repetition, we cannot simply encode it in the
Hash String because it makes the symbol different if the repetition is different. Hence,
we may lose some important features. To still include the repetition in the similarity score calculation and maintain the important feature, we use a Hash Table to store the repetition and calculate the repetition only to calculate the similarity score. In this example, the number of repetition of 04L is 5.
Removing consecutive repeated symbols gives us two advantages:
1. It offers greater flexibility forSufTrain capturing more varieties of similarity for symbol patterns between two instances. Two instances may share similar patterns (such as: 14L-5L) but have different numbers of consecutive symbols, e.g., for14L occurs 10 times in one instance and 5 times in another.
2. It reduces computational cost in constructing and traversing the suffix tree, since the time needed is decided by its length. Hash String is a more compact and shorter representation of the original search trajectory sequence.
After constructing Hash Table and removing repetitions, we convert the symbol for each solution to a single character and concatenate it into a string (Hash String).
(a) Suffix Tree for Single String 5 P$ LMMNP$ 3 2 1 MNP$ M NP$ NP$ 4
(b) Suffix Tree for a Set of String
1,5 P$ MNP$ 1,1 N M NP$ LM 2,1 NMM$ 1,2 1,3 M 2,4 2,2 2,5 2,3 1,4 $ $ N MM$ P$ P$ MM$
Figure 4.7: Example of Suffix Tree for a single string S1 (LMMNP ) and for two
stringsS1=LMMNP and S2=LMNMM
4.2.2
Suffix Tree Construction
The search trajectory sequences found in the previous section is used to build a suffix tree. A suffix tree is a data structure that exposes the internal structure of a string for the particularly fast implementation of many important string operations. Suffix trees are used to solve exact and inexact matching problems in linear time and are widely used in substring problems [46]. The construction of a suffix tree proves to have a linear time complexity w.r.t. the input string length [46].
A suffix treeT for an m-character string S is a rooted directed tree having exactly m leaves numbered 1 to m. Each internal node, except for the root, has at least two
children and each edge is labeled with a substring (including the empty substring) of
S. No two edges out of a node has edge-labels beginning with the same character.
To represent suffixes of a set {S1, S2, ....Sn } of strings, we use a generalized
suffix tree. Ageneralized suffix tree is built by appending a different end of string
marker (which is a symbol not used in any part of the string, such as *) to each string in the set, then concatenating all the strings together, and building a suffix tree for the concatenated string [46]. An example of a generalized suffix tree for strings is LMMNP and LMNMM is LMMNP ∗ LMNMM∗. The time needed to build
this suffix tree is proportional to the total length of all the strings. An example of a suffix tree for a single stringS1 and a set of stringS1andS2is shown in Fig. 4.7.
In a suffix tree structure, we can easily retrieve matching substrings from a set of string by finding the branch that has leaves from corresponding strings. From our suffix tree example (Fig. 4.7b), branches with edge-labelM, N, LM, MM, and MN
shared by S1 and S2. We use such common substrings to extract SufTra instance
features.
We construct the suffix tree for the Hash Strings derived from search trajectories using the Ukkonen’s algorithm [46]. We build a single generalized suffix tree by
concatenating all the Hash Strings together to cover all training instances. The length of the concatenate string is proportional to the sum of all the Hash String lengths. Ukkonen’s algorithm works by first building an implicit suffix tree containing the first character of the string and then adding successive characters until the tree is com- plete. Details of Ukkonen’s algorithm can be found in [46]. Our Ukkonen’s algorithm implementation requires O(n × l), where n is the number of instances and l is the
maximum length of the Hash String.
4.2.3
Features Retrieval
After constructing the suffix tree, we extract the frequent substrings. As described in Definition. 12, a substring is considered as frequent if it has a sufficient length and occurs in a significant number of strings [50]. The minimum number of length and occurrences is determined by minsize and minsupport.We apply a local search to
provide sufficiently good values in reasonable times.
We use a first-improvement local search to move from initial values ofminsizeand
minsupport to their neighbors by changing either minsize or minsupport at each move
until the average distance among all instances in two different clusters are no longer improving. To find initial values of minsize and minsupport, we run a competition
among 5 candidates, which are:
1. Lower bound of minsize and minsupport. We assume a good feature pattern
should appear in more than one instance and contain more than one symbol, therefore, we set the lower bound value for bothminsize andminsupportto 2.
2. Upper bound ofminsize andminsupport. To setminsize and theminsupportupper
minsupport values. If minsize is more than 20% of maximum string length and
minsupportis more than 20% of the number of instances, most likely, we would
not find any frequent substring. Therefore, we set the upper bound default value ofminsizeas 20% of maximum string length and the default value ofminsupport
as 20% of the number of instances.
3. The middle value between the lower and upper bound. 4. First random value.
5. Second random value.
4.2.4
Instance-Feature Metric Calculation
After extracting the features, we calculate the instance’s score for each feature and construct an instance-feature metric using the following rules:
1. if the instance does not contain the feature, the score is 0,
2. otherwise the score is calculated by summing up the number of repetitions for each symbol in the feature from the previously constructed Hash Table. A fre- quent substring may occur multiple times in one string. We calculate the score for each occurrence and choose the maximum score as the score for the instance- feature metric.