Vol 8, No 8 (2018)

(1)

Research Article

August

2018

Computer Science and Software Engineering

ISSN: 2277-128X (Volume-8, Issue-8)

A Survey on Source Code Retrieval for Bug Localization Using

Latent Dirichlet Allocation along with Semantic Similarity and

Code Smell Detection

N. Kamaraj

Assistant Professor and Head of B.Sc IT, Sri Ramakrishna Mission Vidyalaya College of Arts and

Science, Coimbatore, Tamil Nadu, India

[email protected]

A.V Ramani

Associate Professor and Head of Computer Science, Sri Ramakrishna Mission Vidyalaya College of Arts and

Science, Coimbatore, Tamil Nadu, India

[email protected]

Abstract: Bug localization is the task of determining the source code which entities are relevant to a bug report. It is an important task of classification in software data set resources. Manual bug localization is labour intensive since developers must consider thousands of source code entities. To builds bug localization classifiers, based on information retrieval models, to locate entities that are textually similar to the bug report. Current research, however, does not consider the effect of classifier configuration. Designer uses data to detect a bug located in the segment of the source code to correct the bug. They are several categories are used, i.e., code smells detection and pattern clustering. To identify the numerous semantic relations existing between two given words, a pattern clustering algorithm has been proposed. This survey proposed research analysis on the bug localization classifiers based on information retrieval models to locate entities that are textually alike to the bug report.

Keywords–Information retrieval, Code smells, Bug localization, Information extraction, Clustering, Classifiers

I. INTRODUCTION

Bug localization, a developer uses information about a bug to locate the portion of the source code to modify and to correct the bug. Some recent static techniques for automating bug localization have been built around modern Information Retrieval (IR) [1] models such as Latent Semantic Indexing (LSI) [2] and its probabilistic extension LSI [3]. IR-based static techniques aim to identify the elements of the software system that need to be modified to correct a bug.

From the standpoint of retrieval from large software libraries for the purpose of bug localization, five generic text models are compared: Unigram Model (UM), Vector Space Model (VSM), Latent Semantic Analysis Model (LSA), Latent Dirichlet Allocation Model (LDA), and Cluster Based Document Model (CBDM) [12].

Measuring the semantic [16] between words is an important component in various tasks on the web such as relation extraction, community mining, document clustering and automatic metadata extraction. The analysis is an empirical method to estimate semantic similarity using page counts and text snippets retrieved from a web search engine for two words.

Bug localization is the function of condition which source code operation is relevant to a bug report. Manual bug localization is labor in depth since developers must consider thousands of source code operations. Current research builds bug localization classifiers based on information retrieval models, to locate entities that are textually similar to the bug report [5].

II. METHODOLOGIES

2.1 Source Code Retrieval for Bug Localization

Source code retrieval techniques operate on models of the source code that are constructed from semantic information embedded in the code, including identifiers and comments and serve as the documents that are provided as input to an Information Retrieval (IR) technique [1].

IR- based static techniques aim to identify the elements of the software system that need to be modified to correct a bug. To perform Latent Dirichlet Allocation (LDA)-based bug localization on a given version of a software system, first an LDA model of the source code is built. Then, the created model is queried as often as necessary to localize bugs exiting in that version [4].

(2)

ISSN(E): 2277-128X, ISSN(P): 2277-6451, pp. 125-130

study, the analysis of Rhino wills the LDA-based technique [3] perform well across all bugs in the system. The second case study, invokes the application of LDA-based technique to the Eclipse software system, an open source IDE written in Java.

LDA can successfully be activated to source code recover for the purpose of bug localization. Furthermore, the results suggest an LDA-based approach is more effective than approaches using LSI alone for this task. It is planned to examine more bugs in larger software systems, such as Eclipse and Mozilla, to determine how the approach works across a large number of bugs in those systems [1].

2.2 Retrieval from Software Libraries for Bug Localization

While most existing static and dynamic bug localization techniques that may be thought of as bug detection or prognosis tools, an IR-based bug localization technique is best considered as a post diagnosis (debugging) tool [12].

IR-based bug localization uses the VSM model to represent the source code. That contribution used the JEdit software to demonstrate that the patch files that were posted on the bug tracking system were a subset of the files extracted by the IR tool. Use of the Latent Dirichlet Allocation model for source code retrieval for bug localization was explored in [13].

The VSM model is the simplicity of the computations that go into model construction and the ease with which a query can be compared with the documents. Its main disadvantage is the generally large dimensionality and sparseness of the document vectors. The LSA (Latent Semantic Analysis) model attempts to reduce the large space spanned by the vectors of the VSM model.

The Cluster-Based Document Model (CBDM) is another relatively popular approach for text modeling [14]. CBDM is also workable to merge some of the generic models to form what are generally more powerful compound text models. A query is compared with the documents in a model for the purpose of retrieval. Each model has its own way of representing a query.

IR-based bug localization researchers have employed “rank of retrieved files” metric to analyze the performance of their retrieval algorithms [15]. IR-based bug localization tools with those that carry out dynamic bug localization could significantly improve the state of the art in bug localization.

2.3 Web Search Engine-based approach to Semantic Similarity

Danushka Bollegala is proposed an automatic method to estimate the semantic similarity between words or entities using web search engines. Because of the vastly numerous documents and the high growth rate of the web, it is time-consuming to analyses each document separately [16].

Given taxonomy of words, a straightforward method to calculate similarity between two words is to find the length of the shortest path connecting the two words in the taxonomy [17]. A word is polysemous and then multiple paths might exist between the two words. In such cases, only the shortest path between any two senses of the words is review for calculating sameness.

Cilibrasi and Vitanyi proposed a distance metric between words using only page counts retrieved from a web search engine. The snippets get back by a search engine for the conjunctive query of two words supply useful clues related to the semantic relations that exist between two words [18].

Typically, a semantic relation can be expressed using more than one pattern. Identifying the different patterns that express the same semantic relation enables the user or researcher to represent the relation between two words accurately. According to the distributional hypothesis [19], words that occur in the same context have similar meanings. The distributional hypothesis has been used in various related tasks, such as identifying related words, and extracting paraphrases [20].

By sorting the lexical patterns in the descending order of their frequency and clustering the most frequent patterns first, it form clusters for more common relations first. This enables us to separate rare patterns which are likely to be outliers from attaching to otherwise clean clusters [21].

A semantic similarity measure is evaluated using the correlation between the similarity scores produced by it for the word pairs in a benchmark data set and the human ratings. Both Pearson correlation coefficient and Spearman correlation coefficient have been used as evaluation measures in the previous work on semantic similarity. Spearman correlation is more appropriate for evaluating semantic similarity measures, which might not be necessarily linear. In fact, most semantic similarity measures are nonlinear.

2.4 GUI and Web Applications

(3)

ISSN(E): 2277-128X, ISSN(P): 2277-6451, pp. 125-130

Researchers have shown that this type of GUI testing finds faults related not only to the GUI and its glue code, but also in the underlying business logic of the application [23]. Present techniques used in implementation to test such GUIs are mostly manual. The most popular tools used to test GUIs are capture/replay tools such as WinRunner that provide very little automation [24], especially for creating test cases.

Web application testing is defined as exercising the entire application code by generating URL based inputs with the intent of finding failures that manifest themselves in output response HTML pages. Testing of Web program code to locate liability in the program is mostly a manual task. Capture-replay tools capture tester interactions with the application and are then replayed on the Web application [25].

GUI and Web systems consistently support changes as part of their maintenance process. New versions of the applications are often created as a result of bug fixes or requirements modification. Web application GUI concern act differently. That is, some widget actions are handled at the client (e.g., in the form of JavaScript code in the browser) [22].

In spite of GUI and Web applications were modeled in previous research. These two classes of applications have many sameness. These similarities to generate the single model for test suite order of both GUI and Web applications.

A test case is modeled as a chain of actions. For each action, a user sets a value for one or more variables. Prioritization function takes as input a set of test cases to be ordered and returns a sequence that is ordered by the prioritization criterion, because they have developed a unified model of GUI and Web applications [23].

One feature of event-driven software and test cases which also belongings for other types of systems is their province on parameters and values for execution. Relation between multiple parameter values make the program follow a decided execution path and are probable to expose faults in the system. This basic premise led to the development of the first set of prioritization criteria, which are based on giving higher priority to test cases with a large number of parameter values interactions.

One way and Two way parameter value interaction coverage techniques, select tests to systematically enclose parameter value interactions between windows. The final set of criteria gives higher priority to test cases that cover windows that are perceived to be important to the EDS from a testing perspective [26].

Frequency based criteria, importance of a window as the number of times the window is accessed in the test cases. Because of the user-basic design of event-driven software, these windows are probable to contain more code functionality (and likely to contain more faults), and thus test cases that cover.[26].

In all of the implementations, in case of a tie between two or more tests that meet the prioritization criterion, a random tie-breaking strategy is implemented using the RANDOM() function. In addition, a greedy optimal ordering (G-Best) uses a greedy step to select the next test case that detects the most yet-undetected faults and repeat this process until all the tests are selected. A greedy worst ordering (G-Worst) criterion, where in each iteration, the algorithm selects the next test case that covers the least uncovered faults. It repeats this until all of the tests are selected. Ordering by Random selects a next test uniformly at random [26].

2.5. Information Retrieval Model on Bug Localization

Information retrieval is the study of querying for text within a collection of documents [4]. IR-based bug localization classifiers use IR models to find textual similarities between a bug reports (i.e., query) and the source code entities (i.e., documents). IR-based classifiers contain several parameters that control their behavior. Specifying a value for all parameters fully defines the configuration of a classifier [5].

The Vector Space Model (VSM) is a simple algebraic model based on the term-document matrix of a corpus [6]. The term-document matrix is an m*n matrix whose rows represent individual terms, i.e., words and columns represent individual documents. Latent Semantic Indexing (LSI) is an extension to VSM in which singular value decomposition (SVD) is used as a means to project the original term-document matrix into three new matrices: a topic-document matrix D, a term topic matrix T, and a diagonal matrix S of eigenvalues [7].

Latent Dirichlet Allocation is a popular statistical topic model that provides a means to automatically index, search, and cluster documents that are unstructured and unlabeled [8].

Stephen W. Thomas and Meiyappan Nagappan proposed a configuration framework for analyzing the various configurations of a classifier. They used categorical and numerical parameters for their configuration. [32].

They analyzed the performance of the configuration with two different goals. One is to determine the best and worst configuration and the other is to determine which parameter values are most effectively [10].

(4)

ISSN(E): 2277-128X, ISSN(P): 2277-6451, pp. 125-130

The best individual IR-based classifier uses the Vector Space Model, with the index built using tf-idf term weighting on all available data in the source code entities (i.e., identifiers, comments, and past bug reports for each entity), which has been stopped, stemmed, and split, and queried with all available data in the bug report (i.e., title and description) with cosine similarity [10].

2.6. Code-Smell Detection

Code-smells are also called design anomalies or design defects, refer to design situations that adversely affect the software maintenance. As stated by, bad-smells are unlikely to cause failures directly, but may do it indirectly [27].

The code-smells’ detection process consists of finding code fragments that violate structure or semantic properties such as the ones related to coupling and complexity. In this setting, internal attributes used to define these properties, are captured through software metrics and properties are expressed in terms of valid values for these metrics [28].

Optimization problems are often hard and expensive from a CPU time and/or memory viewpoint [29]. The use of Meta heuristics, such as Evolutionary Algorithms and Particle Swarm Optimization (PSO), allows reducing the computational complexity of the search process. Evolutionary algorithm is based on

Genetic Programming (GP) to generate code-smells detection rules and also executed in parallel is Genetic Algorithm (GA) that generates detectors (code smells examples) from well-designed code examples [30].

GP evaluates detection rules from code-smell examples (input) and GA evaluates in parallel to generate detectors from well-designed code (input) using global and local alignment techniques [31]. In the initialization of the Parallel Evolutionary Algorithm (P-EA), one system from the base of examples is selected randomly to calculate the intersection function [33].

GP studies that have newly intense on detecting code-smells in software using different techniques. These techniques range from fully automatic detection to guided manual inspection.

There are seven broad categories: manual approaches, symptom-based approaches, rule-based approaches, probabilistic approaches, visualization-based approaches, search-based approaches and cooperative based approaches [34].

III. ANALYSIS

Table I : Analysis of various literature surveys as shown in the below table.

S.No Author Year Techniques Methods Findings

1 Stacy K. Lukins, Nicholas A. Kraft,

Letha H.Etzkorn

2008 Latent Semantic Indexing (LSI)

Information retrieval

Improve the LDA

2 Shivani Rao, Avinash Kak 2010 Vector Space

Model (VSM)

Bug localization Rank of Retrieval

files

3 Danushka Bollegala, Yutaka

Matsuo , Mitsuru Ishizuka

2011 Lexical Pattern Extraction

Web Text analysis

Information Extraction

4 Renee C Bryce, Sreedevi Sampath,

Atif M Memon

2011 Event Driven Software (EDS)

Web Application Testing

Validity of test suites for web applications

5 Stephen W Thomas, Dorothea

Blostein, Ahmed E Hassan

2013 Singular Value Configuration

Framework

Classify the bugs

Decomposition

(SVD)

6 Wael Kessentini, Marouane, Houari

Sahraoui, Slim Bechikh

2014 Code smells detection

Optimization process

New search based approach

Comparison of various models - Literature Survey

IV. CONCLUSION

Resolving the bug localization problem has major implications for programmers or developers, because it can reduce the time and effort required to maintain or mange software. Classifier combination helps in almost all cases. No matter the underlying classifiers are used or the specific combination technique is used.

(5)

ISSN(E): 2277-128X, ISSN(P): 2277-6451, pp. 125-130

Future work for Lexical Pattern Extraction Algorithm has been used to extract numerous semantic relations existing between two words. Moreover, a Sequential Pattern Clustering Algorithm is to identify different Lexical Patterns that describe the same semantic relation. Both the page counts-based co-occurrence measures and Lexical Pattern Clusters would be used to define features for a word pair.

REFERENCES

[1] Stacy K. Lukins, Nicholas A. Kraft, Letha H.Etzkorn, “Source Code Retrieval for Bug Localization using Latent

Dirichlet Allocation”, 15th Working Conference on Reverse Engineering, IEEE computer society, 2008, pp. 155-164

[2] D.Poshyvanyk, Y.G. Guéhéneuc, A. Marcus, G.Antoniol and V.Rajlich, “Combining Probabilistic Ranking and

Latent Semantic Indexing for Feature Location”, In Proc. 14th IEEE Int. Conf. on Program Comprehension, Athens, Greece, June 2006, pp. 137-148.

[3] T.Hofmann, “Probabilistic Latent Semantic Indexing”, In Proc. 22nd Annu. ACM SIGIR Int. Conf. on Research

and Development in Information Retrieval, Berkeley, CA, USA, August 1999, pp.50-57

[4] C.D.Manning, P.Raghavan, and H.Schutze, Introduction to Information Retrieval, vol. 1, Cambridge Univ. Press

Cambridge, 2008.

[5] Stephen W. Thomas, Meiyappan Nagappan, Dorothea Blostein, Ahmed E. Hassan, “ The impact of Classifier

Configuration and Classifier Combination on Bug Localization”, IEEE Transactions on Software Engineering ,Vol.39, No.10, October 2013, pp.1427-1443

[6] G.Salton, A.Wong and C.S.Yang,“A Vector Space Model for Automatic Indexing,” Comm. ACM,

vol.18,no.11,pp.613-620, 1975.

[7] S.Deerwester, S.T.Dumais, G.W. Furnas, T.K. Landauer, &R.Harshman, “Indexing by Latent Semantic

Analysis,” J. Am. Soc. Information Science, vol. 41, no. 6, pp. 391-407, 1990.

[8] D.M.Blei, A.Y.Ng, and M.I. Jordan, “Latent Dirichlet Allocation,” J. Machine Learning Research, vol. 3, pp.

993-1022, 2003

[9] D.M.Blei & J.D.Lafferty,“Topic Models,” Text Mining: Classification, Clustering

andApplications,pp.71-4.Chapman&Hall, 2009.

[10] M.D’Ambros, M. Lanza, and R. Robbes, “Evaluating Defect Prediction Approaches: A Benchmark and an

Extensive Comparison,” Empirical Software Eng., vol. 17, no. 4, pp. 531-577, 2012

[11] A.T.Nguyen, T.T. Nguyen, J. Al-Kofahi, H.V. Nguyen, and T.N. Nguyen, “A Topic-Based Approach for

Narrowing the Search Space of Buggy Files from a Bug Report,” Proc. 26th Int’l Conf.Automated Software Eng., pp. 263-272, 2011.

[12] Shivani Rao, Avinash Kak , “ Retrieval from Software Libraries for Bug localization: A Comparative Study of

Generic and Composite Text Models”, Waikiki, Honolulu, HI, USA, May 2011, pp. 43-52.

[13] S.K.Lukins, N.A.Karft E.H.Letha. Source Code Retrieval for Bug Localization using Latent Dirichlet

Allocation. In 15th Working Conference on Reverse Engineering, 2008.

[14] X.Liu and W.B.Croft. Cluster-Based Retrieval using Language Models.Proceedings of the 27th annual

international ACM SIGIR conference on Research and development in information retrieval, SIGIR ’04, pages 186–193, New York,NY,USA, 2004,ACM

[15] B.Cleary, C. Exton, J. Buckley, and M. English. An Empirical Analysis of Information Retrieval based Concept

Location Techniques in Software Comprehension. Empirical Softw. Engg., 14(1):93–130, 2009.

[16] Danushka Bollegala, Yutaka Matsuo, Mitsuru Ishizuka, “ A Web Search Engine – Based Approach to Measure

Semantic Similarity between Words”, IEEE Transactions on Knowledge and Data Engineering, Vol 23, No. X, 2011, pp. 01-14.

[17] R.Rada, H.Mili, E.Bichnell, and M. Blettner, “Development and Application of a Metric on Semantic Nets,”

IEEE Trans. Systems, Man and Cybernetics, vol. 19, no. 1, pp. 17-30, Jan./Feb. 1989.

[18] R.Cilibrasi and P.Vitanyi, “The Google Similarity Distance,” IEEE Trans. Knowledge and Data Eng., vol. 19,

no. 3, pp. 370-383, Mar. 2007.

[19] Z. Harris, “Distributional Structure,” Word, vol. 10, pp. 146-162, 1954.

[20] D. Lin, “Automatic Retrieval and Clustering of Similar Words,” Proc. 17th Int’l Conf. Computational

Linguistics (COLING), pp. 768- 774, 1998.

[21] R.Bhagat and D.Ravichandran, “Large Scale Acquisition of Paraphrases for Learning Surface Patterns,” Proc.

Assoc. for Computational Linguistics: Human Language Technologies (ACL ’08: HLT), pp. 674-682, 2008.

[22] Renee C. Bryce, Sreedevi Sampath, Atif M.Memon, “ Developing a Single Model and Test Prioritization

(6)

ISSN(E): 2277-128X, ISSN(P): 2277-6451, pp. 125-130

[23] P. Brooks, B. Robinson, and A.M. Memon, “An Initial Characterization of Industrial Graphical User Interface

Systems,” Proc. IEEE Int’l Conf. Software Testing, Verification, and Validation, pp. 11-20, 2009.

[24] A.M. Memon and Q.Xie,“Studying the Fault-Detection Effectiveness of GUI Test Cases for Rapidly Evolving

Software,” IEEE Trans. Software Eng.,vol.31, no. 10, pp. 884-896, Oct. 2005.

[25] “WebSite Test Tools and Site ManagementTools,”http://www.softwareeqatest .com/ qatweb1.html, Apr.2009

[26] K. Onoma, W.-T. Tsai, M. Poonawala, and H. Suganuma,“Regression Testing in an Industrial Environment,”

Comm. ACM, vol. 41, no. 5, pp. 81-86, May 1988.

[27] Wael Kessentini, Marouane Kessentini, Houari Sahraoui, Slim Bechikh, Ali Ouni, “A Cooperative Parallel

Search –Based Software Engineering Approach for Code–Smells Detection”, IEEE Transaction on Software engineering, Vol 40, No.9, September 2014, pp.841-861

[28] S. R. Chidamber and C. F. Kemerer, “A metrics suite for object-oriented design,” IEEE Trans. Softw. Eng., vol.

20, no. 6, pp. 293–318, Jun. 1994.

[29] P. Siarry and Z. Michalewicz, Advances in Metaheuristics for Hard Optimization (Natural Computing Series).

New York, NY, USA: Springer, 2008.

[30] W. Banzhaf, “Genotype-phenotype-mapping and neutral variation: A case study in genetic programming,” in

Proc. Int. Conf. Parallel Problem Solving from Nature, 1994, pp. 322–332.

[31] D.E.Goldberg, Genetic Algorithms in Search, Optimization and Machine Learning. Reading, MA,USA:

Addison Wesley, 1989.

[32] Stephen.W. Thomas, “Mining Software Repositories with Topic Models”, Technical Report 2012-586, School

of Computing, Queen’s University, 2012.

[33] W.H. Kruskal and W.A Wallis, “Use of ranks in one criterion variance analysis”, J.Amer. Statist. Assoc., vol.

47, no. 260,pp. 583-621.

[34] G.Langelier, H.A.Sahraoui, P.Poulin, “Visualization based analysis of quality for large scale software systems”,