241 | P a g e
HYBRID METHODS BASED ON GENETIC ALGORITHM FOR EFFICIENT TEST DATA
GENERATION
Anusha M Shetty
1, Shalini P.R.
21,2Dept. of Computer Science and Engineering, NMAM Institute of Technology, (India)
ABSTRACT
Test data generation is the process of creating a set of data for testing the acceptance of software applications.
Most of the modern test case generators use Genetic Algorithm for test data generation. This approach has many drawbacks. This paper describes existing Genetic algorithm and its limitations. This paper also gives a survey on hybrid techniques that are used to overcome limitations of genetic algorithm.
Keywords – Genetic Algorithm, Swarm Intelligence, Symbolic Execution, Tabu Search, Test data generation
I. INTRODUCTION
Software testing is used for improving the quality of software. It is a labor intensive and time consuming task. It consumes 50 percent of cost of software development. Introducing automated test generators speeds up the process of testing and increases quality of software being developed. An important factor in automation of software testing is automated test data generation. Many techniques have been proposed over the years for test data generation. Most commonly used techniques for test data generation are random test data generation, symbolic test data generation, dynamic test data generation and test data generation based on genetic algorithms. Out of all these techniques test data generation using genetic algorithm is most widely used approach. Along with test data generation we must also have a test adequacy criterion, which is a measure of completeness of a testing process. Commonly used test adequacy criteria are statement coverage, branch coverage and path coverage. The automatic generation of test data is mostly based on GA. This is because GA is best suited for test data generation compared to other available techniques, but they have some limitations.
These problems can be solved using hybrid methods. This paper gives an overview of genetic algorithm and limitations of using genetic algorithm. It also gives a survey on test data generation using genetic algorithm based hybrid methods and shows how these hybrid techniques overcome limitations of genetic algorithm.
This paper is organized as follows: Section 2 defines genetic algorithm, describes use of genetic algorithm for test data generation and lists limitations of genetic algorithm. Section 3 is a review of hybrid methods based on genetic algorithm used to generate efficient test data. Section 4 presents comparison between surveyed hybrid
242 | P a g e
test data generation techniques based on genetic algorithm. Section 5 describes how hybrid test data generation techniques are used to overcome limitations of genetic algorithms. Section 6 gives conclusion and future work.
II. GENETIC ALGORITHM
Genetic algorithms (GAs) are search algorithms based on principles of natural selection. Genetic algorithms are used for solving optimization problems. Genetic algorithms are inspired by Darwin’s theory of evolution.
Genetic Algorithm starts with a set of solutions called population. The evolution usually starts from a population of randomly generated individuals. Initially, the fitness of every individual in the population is evaluated.
Solutions from one population are taken based on their fitness value and used to form a new population. The new population is always better than the old one. Algorithm is repeated until either a maximum number of generations have been produced, or a satisfactory fitness level has been reached for the population. Fig. 1 illustrates basic steps in a simple genetic algorithm.
Using genetic algorithms in test data generation for software testing is the process of identifying a set of program input data, which satisfies a given testing criterion.
Figure 1: Simple genetic algorithm
In translating the concepts of genetic algorithms to the problem of test-data generation we perform the following tasks [1]:
1. First of all we consider the population to be a set of test data.
Initialize population
Calculate fitness value
Select pair of solution based on fitness
Crossover parents to produce offspring
Mutate new offspring
Place offspring in new population
Criteria met?
Stop
population
No
Yes
243 | P a g e
2. Find the set of test data that represents the initial population. This set is randomly generated according to the format and type of data used by the program under test.
3. Determining the fitness of each individual which is based on a fitness function that is problem- dependent.
4. Select two individuals that will be mated to contribute to the next generation.
5. Apply the crossover and mutation processes.
6. The above algorithm will iterate until the population has evolved to form a solution to the problem (satisfies a given testing criterion), or until a termination condition is satisfied.
Genetic Algorithms are one of the most powerful methods and they give high quality solution to a problem.
Even though they are the first choice for test data generation, they still have some limitations. The limitations of GA includes following:
1. Premature Convergence.
2. Local search capability of GA is not strong.
3. Doesn’t find exact global optima.
4. Problem of writing fitness function – fitness function is not precise.
5. Simple type of genetic operators are used which may destroy input’s data type.
Many researches have done so far to overcome these problems. But there is no one Technique or method found so far that is capable of solving all these problems.
III. THE RELATED WORK
Test data generation is a difficult part in automated software testing. Though there are many different techniques proposed for this purpose, test data generation based on genetic algorithms is most commonly used technique now a days.
Many approaches based on genetic algorithms are used for test data generation. Praveen Ranjan Srivastava in [2] has presented application of genetic algorithm for test data generation. In this work, GAs is used to generate test data for structures which is not achieved in random search approach. Critical paths are examined first and this approach is effective in generating test data. Most of the approaches based on GA have limitations. Sultan H in [1] listed some of the limitations of using GA for test data generation. A comparison of existing genetic-based test data generation techniques is presented in the paper. This comparison is done on the basis of coverage criterion, fitness function, chromosome representation, the base of initial population selection, type and the rate of cross over, mutation operators and population size. Sheng Zhang in [3] shows combination of GA and PSO (Particle Swarm Optimization), which is called GA-PAO algorithm. PSO has strong local search capability and fast convergence characteristics, which can efficiently improve deficiencies of GA. GA-PAO takes advantage of two algorithms by using the breadth of genetic algorithm searching and using particle swarm to optimize the fine particles locally, which find the optimal solution quickly to achieve good optimization results. Application of GA-PAO for test data generation improves number of iterations used and running time. Minjie Yi in [4] has introduced ACSGA (Ant Colony System Genetic Algorithm). ACSGA was created by introducing GA into ACS algorithm to improve efficiency and accuracy of optimization time. In this mixed algorithm, Solution group found by ants and global optimal solution calculated by ACS is set as initial population for GA. Then output of
244 | P a g e
GA (updated population) is used as updated global optimal solution for ACS. Xiajiong Shen and Qian Wang in [5] have given a hybrid method called GATS, which is combination of GA and Tabu Search. GATS combine the local search and climbing ability of tabu algorithms with the parallel search and global search ability of GA to resolve the problem of generating test cases. In this method, a tabu mutation operator is used in the GA. This approach generates efficient test cases and its optimizing performance is superior to simple GA. M.Parthiban in [6] presents a new approach for test data generation – GASE. GASE is a combination of GA with Symbolic Execution and this method achieves higher branch coverage compared to other existing methods. Yong Chen in [7] has presented a concept called BDBFF (Fitness Function based on Branch distance). BDBFF is applied to GA-based path oriented test data generation. Four combinations of selection and crossover operations are compared. With increasing length of chromosomes, combination of Stochastic Universal sampling and Two- point Crossover always outperforms other combinations. Mingjie Deng in [8] has presented a new approach of combing GA with dataflow analysis. This approach achieves sufficient code coverage and it takes less time to generate test data as compared to GA. Ciyong Chen in [9] has shown a new approach called EPDG-GA. By considering correlation between branches, EPDG (Edge Partitions Dominator Graph) is constructed. Then small subset of braches called critical branches (CBs) is identified. Later a metrics is used to prioritise CBs. Finally GA is used to generate test data for covering these branches. EPDG-GA achieves higher branch coverage with less average generations.
IV. COMPARISON OF HYBRID METHOD BASED ON GENETIC ALGORITHM
This section presents a comparison among the genetic algorithms based hybrid test data generation techniques considering dimensions such as coverage criterion, fitness function and so on.
As shown in Table 1, Sheng Zhang [3], Minjie Yi [4] and Young Chen [7] use path as coverage criterion for test data generation. Whereas in the work of M.Parthiban [6] and Ciyong Chen [9], methods used to find input data are based on branch predicates.
Table 1: Comparison based on coverage criterion and fitness function.
Coverage Criterion
Fitness Function
Sheng Zhang Path Branch function
superposition
Minjie Yi Path ffitness(x) = ∑ fi
(x1,x2,...,xm)
M.Parthiban Branch Percentage of
coverage achieved
Yong Chen Path Branch distance
Ciyong Chen
Branch Predicate function
For the second dimension, Fitness function, Sheng Zhang’s [3] fitness function uses a special “Branch function Superposition method”. Also, they quantify each branch predicate with branch function (f). Then, objective function is obtained by superposition of branch function, which is: F = max- (F1 + F2 + ... + Fm). The objective
245 | P a g e
function F is used to assess the strength and weakness of test data. In their experiment 4 paths are chosen as test goal.
Table 2: Branch Fitness Function.
Predicate Fitness Function Φ
a=b if abs(a-b)=0 then 0 else abs(a-b)
a!=b if abs(a-b) ≠0 then 0 else K
a<b if (a-b<0) then 0 else (a-b)+K
a<=b if(a-b≤0) then 0 else (a-b)+K
a>b if (b-a<0) then 0 else (b-a)+K
a>=b if (b-a≤0) then 0 else (b-a)+K
a && b max(Φ(a),Φ(b))
a || b min(Φ(a), Φ(b))
Boolean if TRUE then 0 else K
Minjie Yi’s [4] fitness function is given by: ffitness(x) = ∑ fi (x1,x2,...,xm). This fitness function yields better test data. They carried out experiment on 4 paths. M.Parthiban’s [6] fitness function is the coverage of branches in a program, i.e., the number of covered branches to the total number of branches. Yong Chen’s [7] fitness function is sum of all related branch distance functions. Ciyong Chen’s [9] fitness function is defined by analysing Pre- Dominator Tree (PreDT) and by identifying critical branches (CBs) & prioritizing CBs. A path containing CB is picked out and branch predicates in this path are instrumented with fitness function. They used fitness function in table 2.
Table 3 shows comparison between some of the surveyed hybrid techniques according to genetic algorithm parameters. All these techniques use binary string for chromosome representation. All techniques select initial population randomly. Most of these techniques use single point crossover, Minjie Yi’s [4] technique uses multipoint uniform crossover and Yong Chen’s [7] technique uses two-point crossover. Different crossover rates are used in each of these techniques. Ciyong Chen [9] uses 0.4, Xiajiong Shen [5] and Minjie Yi [4] uses 0.8, Sheng Zhang [3] uses 0.85 and Yong Chen [7] uses 0.9 as crossover rate. Also, all techniques use simple mutation except Xiajiong Shen’s [5] technique use tabu mutation operator. Each technique has different population size. In most of the techniques, approaches used to select the survival individuals are high fitness and in Minjie Yi’s [4] work, ∑ fi (x1,x2,...,xm) is used. All of the surveyed techniques are efficient at generating expected test data based on the scenario (Problem).
246 | P a g e Table 3: Comparison based on genetic algorithm parameters.
Chromo some Represe ntation
Initial Population Selection
Crossover Operator
Mutation Operator
Popula tion Size
Selection
Type Rate Type Rate
Ciyong Chen
Binary String
Randomly Single Point
0.4 Simple Mutation
0.1 100 High fitness Xiajiong
Shen
Binary String
Randomly Single Point
0.8 Tabu Muatation
0.09 50 High fitness Sheng
Zhang
Binary String
Randomly Single Point
0.85 Simple Mutation
0.05 100 High fitness Minjie Yi Binary
String
Randomly Multi point Unifor m
0.8 Simple Mutation
0.1 Path ∑fi
(x1,x2,...,xm )
Yong Chen Binary String
Randomly Two- Point
0.9 Simple mutation
0.01 40 High Fitness
V. BENEFITS GENETIC ALGORITHM BASED HYBRID METHOD
Hybrid methods based on genetic algorithms are nothing but combining genetic algorithm with another technique or method so as to improve the efficiency of test data generation. Table 4 shows surveyed hybrid methods and advantages of using them.
Section 2 gives a list of limitations using genetic algorithm. All the surveyed hybrid techniques have the capability to overcome one or more of these limitations. GAs suffers from premature convergence. ACSGA solves this problem by introducing GA into Ant Colony System algorithm. ACS has the ability to inhibit premature convergence. GA is not strong in local search. There are two surveyed techniques that help in improving local search efficiency. One of them is GA-PAO (combination of GA and PSO). Another one is GATS (combination of GA and TS) PSO and TS has strong local search capability, so introducing them in GA will improve local search ability. GA fails to find exact global optima. GATS have the ability to find exact global optima. Tabu search always converges easily and GA has ability to find near global optima. Thus, GATS algorithm converges at exact global optima. Fitness function plays a key role in GA. So we need to select a fitness function in such way that generated input data should be able to satisfy test requirements. Yong Chen applied BDBFF (fitness function based on branch distance) to one of GA based technique. In his approach he has taken a solid fitness function. This we can see from the results. Most of the hybrid techniques chosen define a solid fitness function. GA-PSO and ACS-GA also have well defined fitness functions. So these techniques help in generating efficient data. In GA, simple types of genetic operators are used. Xiajiong Shen has shown use of “Tabu mutation operator” instead of simple mutation operator. This helped in generating efficient test data. Also, Yong Chen has used “two-point cross” and Minjie Yi has used “multipoint crossover”. Genetic operators used in these techniques are best suited for the situation.
247 | P a g e Table 4: Benefits of Hybrid techniques
Technique Used Advantages
Sheng Zhang
GA + PSO Reduces running time and number of iterations.
GA helps in global search and PSO helps in local search. So overall efficiency is increased.
Minjie Yi ACS + GA Inhibits Premature Convergence Xiajiong
Shen
GA + TS Achieves exact global optima and converges.
Improves local search.
M.Parthiba n
GA + SE High branch coverage.
Yong Chen GA with BDBFF Consumes fewer system resources and improves probability of finding suitable solution.
Mingjie Deng
GA with Data flow analysis Fast test data generation and sufficient code coverage.
Ciyong Chen
GA with EPDG High branch coverage with less average generations.
All the surveyed techniques are efficient for test data generation when compared to genetic algorithm. Surveyed techniques also help in reducing running time of the algorithm and they take less number of iteration for test data generation. So, hybrid techniques are more useful and appropriate for test data generation.
VI. CONCLUSION AND FUTURE WORK
The GA based Hybrid methods make the test-data generation process easy. This paper gives advantages of using “Hybrid Methods” for test-data generation. The surveyed methods solve some of the problems of GAs.
But there is no one solution or a technique that is capable of solving all the problems of GAs. Future work is intended to find a solution that can overcome major limitations of GA based techniques. More researches are required to improve quality of test data generation.
REFERENCES
[1] Sultan H. Aljahdali, Ahmed. S. Ghiduk, Mohammed El-Telbany. The limitation of genetic algorithms in software testing. IEEE/ACS International Conference on Computer Systems and Applications (AICCSA), 1-7, May 2010.
[2] Praveen Ranjan Srivastava, Tai-hoon Kim. Application of Genetic Algorithm in Software Testing.
International Journal of Software Engineering and Its Applications Vol. 3, No.4, October 2009.
[3] Sheng Zhang, Ying Zhang, Hong Zhou, Qingquan He.Automatic path test data generation based on GA- PSO. IEEE International Conference on Intelligent computing and Intelligent Systems (ICIS) Vol.1, 142- 146, October 2010.
[4] Minjie Yi. The Research of path-oriented test data generation based on a mixed ant colony system algorithm andgenetic algorithm. 8thInternational Conference on Wireless Communications, Networking and Mobile Computing(WiCOM), 1-4, September 2012.
[5] Xiajiong Shen and Qian Wang, Peipei Wang and Bo Zhou, Automatic Generation of Test Case based on GATS Algorithm.IEEE Conference on Granular Computing, 496-500, 2009.
248 | P a g e
[6] M.Parthiban, M.R.Sumalatha. GASE - An Input Domain Reduction and Branch Coverage System Based on Genetic Algorithm and Symbolic Execution. International Conference on Information Communication and Embedded Systems (ICICES), 429-433, February 2013.
[7] Yong Chen, Yong Zhong. Experimental Study on GA-based Path-Oriented Test Data Generation Using Branch Distance. Third International Symposium on Intelligent Information Technology Application, Vol. 1, 216-219, November 2009.
[8] Mingjie Deng, Rong Chen, Zhenjun Du. Automatic Test Data Generation Model by Combining Dataflow Analysis with Genetic Algorithm. Joint Conference on Pervasive Computing (JCPC), 429-434, December 2009.
[9] Ciyong Chen, Xiaofeng Xu, Yan Chen, Xiaochao Li and DonghuiGuo. A New Method of Test Data Generation for Branch Coverage in Software Testing Based on EPDG and Genetic Algorithm. 3rd International Conference on Anti-counterfeiting, Security and Identification in Communication, 307-310, August 2009.