2.7 DFA Inference Competitions
3.1.11 DFASAT Algorithm
The earlier study by Heule and Verwer [118] suggested the translation of the DFA inference problem into satisfiability (SAT). This translation that has been used by Heule and Verwer [118] is inspired by the previous translation of the DFA identification problem into the graph colouring issue [120].
It is the problem of colouring nodes in the given graph where nodes connected with an edge have a different colour, and sometimes is known as state colouring. The DFA identification problem use the colouring graph such that compatible states in the same block are coloured with the same colour, and those that cannot be merged are given different colours [120].
Heule and Verwer [118] have focused on translating the graph colouring strategy into SAT. However, this translation can result in a huge number of clauses, which is too difficult for the existing SAT solver. This explain why the DFASAT algorithm attempts to run EDSM in the earlier steps before calling the SAT solver to complete the inference process and avoiding handling the large number of clauses.
Heule and Verwer [110] developed the DFASAT algorithm that attempts to find multiple DFA solutions inferred for each inference tasks. The number of solution is identified by the user by setting the parameter n. Heule and Verwer [110] stated that the early solutions obtained by the DFASAT algorithm can reach 99% accuracy if the training data is not sparse. However, multiple solutions can be combined to classify the test set during the StaMinA competition if the data is very sparse.
In general, DFASAT begins by running EDSM in the early steps in order to reduce the problem of inferring DFA to be solvable by the SAT solver. The resulting state machine from this stage is called a partial DFA. The reason behind incorporating the SAT solver is to solve the problem when the EDSM learner becomes very weak at finding good DFA solutions [110].
It is important to identify when to stop the EDSM learner and start running the SAT solver. Heule and Verwer [110] introduced the m parameter to determine when to stop the traditional EDSM state merging and begin the SAT solving. The method stops the merging procedure when the number of states that are reachable by the positive examples obtained from the provided training samples is less than m. The parameter m is set to 1000 in the StaMinA competition.
The DFASAT algorithm is illustrated in Algorithm 6. The DFASAT learner begins with the initialization of a parameter t to infinity, this parameter is used later to indicate the target number of states for the inferred DFA. The benefit of setting the parameter t is that if the number of red states in the current hypothesis DFA is larger than t, then the performed merges are assumed to be inefficient [110]. The setting of parameter t is initially equal to infinity, and many merges are performed using the greedy procedure before calling the SAT solver when |R| ≤ t to reduce t to the size of red states R [110]. After initializing t, the DFASAT invokes generateAPTA (S+, S−) to generate the initial APTA from the
provided samples. States are selected and merged using the EDSM algorithm for several steps as shown in lines 7-11.
The parameter m is used as a boundary for a number of mergers to be performed using EDSM before starting the SAT solver. Once the number of states in A that are reached by the positive examples is smaller than m, the SAT solving will begin to find the smallest DFA [110]. Otherwise, it continues learning LTSs using EDSM. A parameter t is used
Require: an input sample of sequencesS = S+∪ S−, a test sample S
t, merge bound
m, number of DFA solutions n, accepting vote percentage avp between 0 and 1
Ensure : Label is a labelling for Staimed to give high accuracy for software models 1 Let t ← ∞
2 Let D ← ∅ //D is a set of multiple DFAs solutions
3 A= GenerateAPTA (S+, S−) //generate the APTA A from sequences 4 while |D| < n do
5 //while the number of DFA solution is less than n
6 Let A
0
← copyAPTA (A) // create another copy of APTA A0
7 while |A0|p < m do
8 //while the positive sequences reach more than m states in A
0
9 select q and q0 in A0 using random greedy ;
10 A0 = merge (A0, q, q0) // merge states in A0 using random greedy 11 end
12 // if A
0
has more than t red states
13 if |R| > t (R being the red states in A0) then
14 // find a better partial DFA solution
continue the next while loop iteration
15 end
16 set t ← |R| // else update t to the amount of resulting red states 17 let i ← 0 // initialize the number of additional states to 0 18 // while no solution has been found for the remaining problem 19 while true do
20 translate A
0
to a SAT formula using |R| + i colours // try to find an exact solution with i extra states
21 solve the formula using a SAT-solver ;
22 if the solver return a DFA solution A00 then
23 // if the SAT solver finds a solution add it to D
24 add A00 to D and break
25 else if the solver used more than 300 seconds A00 then
26 break // try another partial solution if the problem is too hard 27 else
28 set i ← i + 1 // else try to find a larger solution
29 end 30 end 31 end
32 let Label be an empty labeling // initialize the test labeling
33 // iterating through test set St
34 forall the s ∈ St do
35 if |{A ∈ D|s ∈ L(A)}| ≥ avp then 36 append ‘1’ to Label
// s is labelled as positive because at least avp % of the solutions accept s
37 else
38 append ‘0’ to Label // label s as negative
39 end
40 return Label 41 end
later to refer to a target size of a DFA [110]. Once the APTA becomes small, the APTA is translated to many clauses and they are passed to the solver to find a DFA as shown in lines 19-30. Every time a DFA is inferred, it will be added to the set D as shown in line 24. The reason behind collecting all possible solutions is to find the optimal generalization of DFA using multiple DFA solutions using the ensemble method [121].
The DFASAT algorithm attempts to generate many DFA solutions. When a number of DFA solutions are generated, the test sequences are passed to each DFA to decide which of them are rejected or accepted. [110] introduced accepting vote percentage (avp) such that if a test sequence is accepted by avp % of the generated DFA, then it is classified as positive, and otherwise, it is classified as negative. This idea is motivated by the ensemble method [121] to improve the classification accuracy and treating the problem of data sparseness in the StaMinA competition.