A dependency-aware, context-independent code search infrastructure

(1)

A Dependency-Aware, Context-Independent Code

Search Infrastructure

Inauguraldissertation

zur Erlangung des akademischen Grades

eines Doktors der Naturwissenschaften

der Universität Mannheim

vorgelegt von

Diplom-Informatiker Marcus Schumacher

aus Heidelberg

(2)

Referent: Prof. Dr. Colin Atkinson, Universität Mannheim

Korreferent: Prof. Dr. Ralf H. Reussner, Karlsruher Institut für Technologie (KIT)

(3)

Abstract

Over the last decade many code search engines and recommendation systems have been developed, both in academia and industry, to try to improve the component discovery step in the software reuse process. Key examples include Krugle, Koders, Portfolio, Merobase, Sourcerer, Strathcona and SENTRE. However, the recall and precision of this current generation of code search tools are limited by their inability to cope effectively with the structural dependencies between code units. This lack of “dependency awareness” manifests itself in three main ways. First, it limits the kinds of search queries that users can define and thus the precision and local recall of dependency aware searches (giving rise to large numbers of false positives and false negatives). Second, it reduces the global recall of the component harvesting process by limiting the range of dependency-containing software components that can be used to populate the search repository. Third, it significantly reduces the performance of the retrieval process for dependency-aware searches.

This thesis lays the foundation for a new generation of dependency-aware code search engines that addresses these problems by designing and prototyping a new kind of software search platform. Inspired by the Merobase code search engine, this platform contains three main innovations - an enhanced, dependency aware query language which allows traditional Merobase interface-based searches to be

(4)

infrastructure which can recognize dependencies between code units even when their context (e.g. project) is unknown, and a new graph-based database integrated with a full-text search engine and optimized to store code modules and their dependencies efficiently. After describing the background to, and state-of-the-art in, the field of code search engines and information retrieval the thesis motivates the aforementioned innovations and explains how they are realized in the DAISI (Dependency-Aware, context-Independent code Search Infrastructure) prototype using Lucene and Neo4J. DAISI is then used to demonstrate the advantages of the developed technology in a range of examples.

(5)

Zusammenfassung

Im letzten Jahrzehnt wurden sowohl im akademischen als auch im industriellen Bere-ich zahlreBere-iche Code-Suchmaschinen und so genannte Recommendation-Systeme entwickelt, um den ersten Schritt im Prozess der Wiederverwendung von Soft-warekomponenten zu verbessern, die Suche nach passenden Komponenten. Be-deutende Beispiele der letzen Jahre waren oder sind Krugle, Koders, Portfolio, Merobase, Sourcerer, Strathcona oder SENTRE. Allerdings ist die Trefferquote und die Genauigkeit dieser aktuellen Generation an Programmen für die Code-Suche gewissermaßen begrenzt, da sie nur bedingt in der Lage sind strukturelle Abhängigkeiten zwischen verschiedenen Code-Einheiten effektiv dar zu stellen. Dieser Mangel an “Abhängigkeitsbewusstsein” findet sich dabei in drei Hauptaspek-ten. Erstens ist die Art wie Benutzer Suchanfragen an das System definieren können, und damit die Genauigkeit und die lokale Trefferquote abhängigkeitsbewusster Suchen, eingeschränkt (was zu einer großen Anzahl von positive und false-negative Ergebnissen führt). Zweitens ist die globale Trefferquote des Komponenten-Harvesting-Prozesses verringert, da die Möglichkeit Softwarekomponenten, die Abhängigkeiten enthalten, in einem der Suche zu Grunde liegenden Such-Repository abzubilden begrenzt ist. Drittens ist die Effizienz des Prozesses des Information Retrieval für abhängigkeitsbezogene Suchen signifikant reduziert.

(6)

bezogener Code-Suchmaschinen, bei der diese Probleme und Einschränkungen durch den Entwurf und das Prototyping einer neuen Art einer Suchplattform für Software gelöst werden. Inspiriert von der Merobase-Codesuchmaschine werden hier drei Hauptinnovationen präsentiert: eine erweiterte, abhängigkeitsbezogene Abfragesprache, mit der herkömmliche, auf der Merobase basierende Suchanfragen, um Abhängigkeitsanforderungen erweitert werden können. Eine neue, “kontextun-abhängige” Crawling-Infrastruktur, die Abhängigkeiten zwischen Codeeinheiten erkennen kann. Und die Integration einer Graphen-Datenbank in eine Volltextsuch-maschine die auf eine effiziente Speicherung von Codemodulen und deren Ab-hängigkeiten optimiert ist. Nach den Grundlagen und den aktuellsten Techniken auf dem Gebiet der Codesuchmaschinen im Bereich des Information Retrieval motiviert die Dissertation die oben genannten Innovationen und erläutert, wie diese innerhalb der DAISI (Dependency-Aware, Context Independent Code Search) auf Basis von Lucene und Neo4J umgesetzt werden. Anhand der DAISI werden auch die Vorteile der entwickelten Technologie anhand einer Reihe von Beispielen demonstriert.

(7)

Acknowledgment

After many years of intensive work, this Acknowledgment represents the culmination of my dissertation project. The work was intensive but also rewarding, not only due to the many things I was able to learn exploring my research topic but also due to the many interesting people I was able to work and interact with. It was always a pleasure being at the chair, and over the years many of my colleagues have become friends.

First and foremost I would like to thank my supervisor, Colin Atkinson, for giving me the chance to research on this stimulating topic and for always having an open ear for questions and productive discussions. The atmosphere you have created at your chair helped me in many ways, and provided the freedom needed for creative thinking. Without this, many of my colleagues and I would not have been able to come up with such innovative solutions to the problems we tackled.

Second, I would like to thank all the colleagues who have accompanied me all these years, especially Oliver Erlenkämper, Werner Janjic, Ralph Gerbig, Thomas Schulze and Marcus Kessel. Thank you for your encouragement and all the discussions in the coffee corner where a lot of ideas were born (about all manner of things including our research topics).

(8)

supported and encouraged me all these years, and my sons Fabian and Florian, who sometimes had to put up with playtime interruptions when their father had to check on his research prototype or implement a new idea. It has been a struggle to get this dissertations finished, but sometimes the best research ideas came up when building turrets or houses. I therefore dedicate this work to my family.

(9)

Dependency-aware Metamodel

. . . 65

5.1 The Core Metamodel 66

5.1.1 Extended Metamodel for Java 72

5.2 Infrastructure of the Graph 75

5.3 Text Document Storage 83

6

Environment-Independent Harvesting

. . . 89

6.1 Crawling and Parsing 90

6.1.1 Context-Independent Content Analysis 92

6.1.2 Handling the source code 104

(11)

7

Dependency-Aware Searches

. . . 113

7.1 DAQL 114

7.2 Search Types 118

7.3 Classification using Graph IR Methods 123

8

Diagrammatic Query Definition

. . . 125

8.1 The Search Event 126

8.1.1 Reuse Scenarios 126

8.2 UML-based Search 133

8.3 Search User Interface 137

8.4 Drag and Drop Search 139

9

Evaluation

. . . 143

9.1 Simple Case 144

9.1.1 Case 1 145

9.1.2 Case 2 151

9.2 Methods from Superclasses 156

9.3 Complex Scenario 160

9.4 Hypothesis Validity 163

10

Conclusion

. . . 169

10.1 Weaknesses 171

(12)

Books 175

(13)

List of Figures

1.1 Simple CustomerManagement System 4

2.1 IR Model 22

2.2 Boolean Retrieval Model Query conjunctive components 23

2.3 Undirected and directed graph 28

3.1 Sourcerer relational metamodel 41

3.2 Merobase relational metamodel 42

3.3 A simple example representing classes in a graph 44

4.1 Query comparison between standard search engines and Exemplar 56

4.2 Architecture of the Sourcerer infrastructure 58

5.1 Core Metamodel 67

5.2 Java Specific Metamodel 72

5.3 Example Application of the Core Metamodel 76

5.4 CandidateCollection example 83

6.1 Crawling and parsing process of the DAISI search engine 92 6.2 Process of the analyser and the different file formats 93

6.3 MOF pyramid defined by the OMG 97

6.4 Process of transforming Source Code to a KDM to a UML Model 99

6.5 Root elements of the ASTM metamodel 100

6.6 CandidateCollection example 107

(14)

8.1 Design -> Implementation -> Validation Process 132 8.2 KobrA representation of the CustomerManagement example 137

8.3 Search result list 138

8.4 Details of a component 139

8.5 UI of the “drag and drop” search possibility 140

9.1 Stack - Item simple example 145

9.2 Berrypicking search process 148

9.3 Stack - Item example 151

9.4 ThreeStack classes connected to the same Item class 154

(15)

List of Tables

4.1 Fields of the Merobase index 50

4.2 Lucene field for theupdateCustomer method 51

4.3 Counter for same parameter signature 51

5.1 Properties of the individual class nodes in the graph 77

5.2 Properties of the CodeMethod 79

5.3 Index structure forCodeMethods 85

5.4 Index structure forCodeClasses 86

6.1 MOF relationship of KDM and ASTM 98

6.2 MOF relationship of KDM and ASTM 100

6.3 Java specific SASTM Elements 101

6.4 Removed terms of the source code for keyword based search 106

7.1 Prefixes for the different search capabilities 115

7.2 Relation types within the query 118

9.1 P@5 and P@10 metric of the simple Stack search 150

9.2 P@5 and P@10 measurement values for the Stack search with Item 155

9.3 precision, recall and F-measurement values 156

9.4 P@5 and P@10 values for the Customer search with methods 158 9.5 P@5 and P@10 measurement values of the Customer search 160

(16)

(17)

1. Introduction

Google can bring you back 100,000

answers. A librarian can bring you

back the right one.

– Neil Gaiman –

Today software permeates almost every part of our lives and environment, whether it be as programs within computers, applications on smartphones, embedded controllers within consumer goods or artificial intelligence within autonomously driving cars. Software lies at the heart of all modern, “smart” products. However, the development of software is still a tremendously costly process. Although there have been many changes in software engineering approaches over the last 50-60 years, with waterfall processes gradually giving way to agile processes [Som01], software development still primarily revolves around the notion of writing code from scratch.

The idea of systematically building new applications from pre-existing components was first promoted in the 1960’s to increase software quality and raise productivity [McI68] [Moh+04]. However, today reuse is essentially only practised in ad-hoc, opportunistic ways by individual developers who happen to be aware of existing software that could fulfil their needs [LM89] [HW07]. Large scale, systematic reuse

(18)

that mirrors a component-based assembly like in other industries, e.g. the automotive industry, where almost all new cars have standard, reusable components installed (e.g. radio, air conditioning, fuel pumps etc.), is in software engineering still a long way off, although the benefits of prefabricated component assembly in software engineering are potentially just as dramatic. For example, Lime found that the defect density in software systems built from existing components was half that of systems developed from scratch through normal processes [Lim94].

However, systematically supporting software reuse is difficult because there are so many different forms of components and ways of reusing them. A “component” can be as small as a block of code or as large as a complete subsystem or framework, and component reuse can take many different forms, depending on how many components are involved and the reuser’s degree of knowledge about a component’s internal realization. In terms of quantity, a developer may sometimes only wish to reuse one independent component and on other occasions a developer may wish to reuse several interconnected components. In terms of knowledge, a developer may sometimes want to reuse a component “as-is” in a black box way, without any knowledge about how it works internally, and on other occasions a developer may wish to reuse a component in a white box way by modifying it for the task in hand or simply learning from the way it is implemented [Sim+11].

Those differences have an impact on all phases of the reuse process. However they present by far the biggest challenge for the first and arguably most important step which is to find suitable reuse candidates in the first place. Probably the most effective way of boosting software reuse in software engineering is to provide easier and more reliable ways for developers to find components to support all the different kinds of components and reuse forms. Over the last few years there have been many attempts to improve the component discovery step in the software reuse process, including the development of code search engines like Krugle [Kru13], Koders, Portfolio

(19)

1.1 Dependency Awareness 3 [McM+11], Merobase [Jan+13] or Searchco.de1and plug-in recommendation tools such as Strathcona [HM05] or SENTRE [Jan14]. However, the current generation of software search engines or recommendation systems only supports a few of the aforementioned component types and reuse scenarios. In fact, most developers still use text-based searches on Google, or similar web search engines, to find components to use “as-is” or to find reference examples they can use in their own applications [Sim+11]. Many important cases are only supported in a very rudimentary way or not at all. But as the quote by Gaiman at the beginning hinted “Google can bring you back 100,000 answers. A librarian can bring you back the right one.”, today’s software search engines can return many results to almost every query, but they are frequently not the right ones. To improve the precision of software search engines it is necessary to make them aware of the structure of software components as well as of the “text” elements they contain.

1.1

Dependency Awareness

As an example of the kind of challenges faced by developers when trying to use today’s code search engines to support reuse, suppose a developer is responsible for building a customer management (sub-)system based on the core classes and relationships shown in figure 1.1. The central class in this system is the “ Customer-Management ” class which is the entry point for looking up customers, updating customers and creating new ones. This class is therefore responsible for managing multiple instances of the class “Customer”. At the beginning of the development process, the detailed properties of these classes are largely unknown, but it is clear that theCustomerclass must at least contain attributes to store customer names and addresses.

(20)

CustomerManagement CustomerManagement + getCustomer(String) : Customer + addCustomer(Customer) Customer Customer - name : String + getName() : String - forename : String

+ updateCustomer(Customer) - address : Address + getAddress() : Address

Figure 1.1: Simple CustomerManagement System

A developer who is open to reusing existing code to build this subsystem might start by entering a query of the form “java customermanagement customer getname” in a general purpose search engine such as Google and as for example shown by Sim et al. [Sim+11], most developers turn to such search engines to find code. At the time of writing, entering this query into Google returned only one remotely relevant result at position two – a single Java class calledCustomerManagement hosted at GitHub. The classCustomeris completely missing in this result. Entering this query into Koders or Krugle, the two most popular and well known code search engines, returned no results at all. The problem is the sensitivity of these search engines to the exact keywords supplied in the query. If the identifiers in a component stored in the repository deviate in the slightest way from the keywords in the query, these search engines are unable to detect a match.

To provide better results, Krugle and Koders both provide special prefixes to identify what role specific keywords should play in software. For example, it is possible to obtain relevant results for this scenario from Krugle by changing the query to “cus-tomermanagement functiondef:getCustomer”, and in the case of Koders a much more effective query is “cdef:customermanagement mdef:getCustomers”. This specifies that, to be included in the result set, a class must either have the name Customer-Management or a method calledgetCustomers, or both. The inability of the first example query to return any results from Koders and Krugle suggest they do not have any classes calledCustomerManagementin their index. Therefore, in the case

(21)

1.1 Dependency Awareness 5 of Krugle it was also necessary to change the individual query to “customer function-def:getCustomer” where the class name was changed fromcustomermanagementto

customer, and in the case of Koders to “cdef:customerservice mdef:getCustomers” to search for acustomerserviceinstead of thecustomermanagement.

Of course, developers are usually interested in the functionality offered by compo-nents, not their exact name. However, unlike Google major code search engines like Koders and Krugle are unable to take similar names into account when searching for components. Additionally, in both cases the given query was unable to convey the requirement for thegetName-method. General purpose search engines like Google provide no special prefixes for code and are thus unable to recognize the types or names of any input- / output-parameters of methods. Google simply matches keywords in the query to identifiers in the source code, regardless of what they mean, but applies sophisticated name similarity detection techniques. In summary, Google almost always returns a high number of results, but the vast majority of them are irrelevant (low precision), while mainstream code search engines such as Krugle and Koders often return few if any results (low recall).

The new generation of academic search engines developed over the last couple of decades such as Koders, Portfolio, Merobase or Searchco.de have tried to address this problem by introducing new ways of formulating queries. For example, Mer-obase introduced interface-based searches in which queries essentially describe the signature information wrapped up in interface specifications. For example the query

-Customer(getName():String; getId():String; setName(String))

will search for components (e.g. classes) calledCustomer offering methods get-Nameand getId which return a String value and setNamewhich accepts a string parameter. This significantly increases the precision of the results compared to

(22)

the aforementioned search engines which simply look for the union of the listed features. However, the results are still very sensitive to the identifiers chosen in the query. Merobase therefore offers a variant of interface based search, called signature based search, which only specifies the signatures of the required methods, without specifying the identifiers. This, of course, has the effect of increasing the recall again but at the expense of precision. However, while these more advanced query specification capabilities allow precision and recall to be improved, especially when combined with systematic query reformulation techniques, they still focus on individual components.

The basic problem with existing code search engines is their lack of “awareness” of the typical kinds of dependencies that exist between code elements. This is a fundamental problem because in the object-oriented programming languages primarily used to write software today, functionality is almost always distributed over several classes. Making search engines, and the queries that drive them, “dependency aware” would allow them to consider the content of several related classes rather than one individual class when establishing their relevance for the user.

In case of the Customer Management example, not only would it be possible to search for aCustomerManagementclass which has some kind of relationship to a

Customerclass, it would also be possible to specify the precise kind of relationship and properties of theCustomerclass, likenameoraddressattributes. Additionally, the immediate environment of each class could be examined so that methods that are not contained by the class itself, but are defined in other classes (e.g. superclasses) can be taken into account. Dependency-aware search technology therefore has the potential to further enhance the precision and recall of searches. It can enhance the former by allowing users to specify precisely what elements they expect, components to be composed of and what relationships should exist between them. It can enhance the latter by allowing components to be considered as candidates that do not provide

(23)

1.2 Context-Independent Harvesting 7 all of the desired functionality themselves but do so with the help of other dependent components.

The first step towards exploiting dependency awareness in code searches was made by the search engine Portfolio [McM+11] which considers the process flow of method calls in related classes, and SENTRE, which loads all dependencies after a component is selected [Jan14]. However, they exploit dependency awareness in only very limited and narrowly focused ways. At the present time there is no code search engine which allows the full range of dependencies supported by modern object-oriented programming languages to be explicitly considered in the formulation and execution of code searches.

1.2

Context-Independent Harvesting

As soon as the focus of searches extends beyond individual code elements to collec-tions of elements, the question of how these elements are correlated and harvested becomes a major question. More specifically, the problem of how to examine whether elements belong together has to be addressed. The simplest and most straightforward approach, if possible, is to harvest all elements from a given project or package (e.g. jar file) because they are related to each other in a given “context” and were usually written by the same development team. The most common and straightfor-ward way of harvesting elements and their dependency relationships is to collect them from online project repositories such as Github or Sourceforge, which are typically configured by dependency management tools like Maven, Ivy or Gradle. They capture dependencies in configuration files, along with all the inter-related code elements within projects. However, such configuration files can often get lost or be corrupted, or for some reason another code element may become unavailable (e.g. due to versioning problems). This makes it impossible for crawlers that rely on

(24)

full context information to harvest such code. Also, a great deal of reusable code is available as individual elements and code snippets in internet forums like Stack-overflow2. Moreover, these code snippets are often of very high quality [SH13a] because they address concrete problems that are discussed and explained by a large number of developers. A context-independent dependency resolution mechanism which can inter-connect elements harvested from different contexts would therefore significantly enhance a dependency aware search engine by allowing it to (a) harvest more interrelated code elements (with an awareness of the dependencies between them), and (b) be more robust against events that break the integrity of the index, such as the movement of a project to another server.

Since context-dependent harvesting is much easier than context-independent har-vesting we assume that any search engine that does the latter will also do the former. In other words, we assume that context-independent harvesting subsumes context-dependent harvesting. Conceptually, if a repository over which searches are performed is regarded as complete, e.g. consists of all existing software components which can be found in the global internet, rather than a subset, created by a search engine’s crawler, a context-independent harvesting mechanism essentially increases the global recall of a search engine since it allows more component candidates to be considered.

The benefits of such context-independent harvesting can be directly seen in the example above. To reuse the result returned by Google from the Github project, which consists of only one class,CustomerManagement, a developer either has to implement the second classCustomerby himself or has to perform a second search. Even if the implementation of theCustomer class may not be that difficult, a context-independent harvesting process would be able, in the case of a complete index, to find anotherCustomer class containing the necessary structure and information and

(25)

1.3 Research Goals 9 relate it to theCustomerManagementclass.

1.3

Research Goals

To address the aforementioned issues this thesis presents the foundations for a new generation of “dependency-aware” search engines which can be populated through “context-sensitive” harvesting, and demonstrates their effectiveness by means of a prototype implementation. In addition, the new technologies created for this purpose were developed to fulfil the following sub-goals:

1. seamless extension: seamlessly extend the capabilities of the current

genera-tion of code search engines.

2. language independence: ensure compatibility with all mainstream

object-oriented programming languages and software repositories.

3. scalable: provide reasonable performance (i.e. response time) even for very

large scale repositories.

The general hypothesis behind this work is, therefore, that this new search engine technology, which we refer to as DAISI (Dependency-Aware, context-Independent code Search Infrastructure), will increase the precision and recall of code search engines, both in terms of “local recall” relative to the harvested code repository, and in terms of the “global recall” relative to the conceptually available components in the Internet. The particular search engine chosen as the baseline for the research was the Merobase search engine also developed at the Chair of Software Engineering at the University of Mannheim. Wherever possible, the new capabilities were added to the existing features of Merobase (e.g. the query language and document indexing structure) in an as natural and simple way as possible. The particular language chosen as the focus of the prototype implementation was the Java programming language.

(26)

When it comes to storing information about relationships, the text-based document indexing approaches used to implement the current generation of code search engines are not scalable for a stand-alone solution. Relational databases are also unsuitable to cover the requirements of solid relationship-based dependency discovery because of the large number of “joins” that are often required to resolve relationships [HAS13]. This thesis therefore explores the novel approach of using a combination of traditional text-based indexes and a graph database to store the searchable content.

In order to achieve the second sub-goal of being language independent, the doc-ument schemata and graph database nodes need to be correlated and language agnostic (i.e able to cope with all major object-oriented language concepts). A major contribution of the work presented in this thesis is therefore the definition of a language-independent metamodel which captures the key information needed to support dependency-awareness in an efficient and level-agnostic way.

Finally, in order to support context-independent harvesting it was necessary to design a two-phase parsing process, and a generic parsing approach which on the one hand supports the abstract dependency representation approach defined in the language-agnostic metamodel and on the other hand supports the straightforward implementation of language specific parsers and analysers.

(27)

1.3 Research Goals 11

1.3.1 Hypotheses

The research goals described in the previous section essentially postulate the validity of the following concrete hypotheses

-Hypothesis 1

It is possible to build a scalable, context-independent, dependency-aware, code search engine populated through context-independent harvesting.

The premise behind this hypothesis is that it is possible to build a code search engine that demonstrates the desired properties of language agnosticism and dependency awareness that is “practical” in the sense that it is (a) able to harvest and index a “meaningful” collection of components (in a context in-dependent way) and (b) able to return results to dependency-aware searches over a “large” repository within a “reasonable” amount of time. The validity of this hypothesis is not self-evident since no code search engine with these properties has yet been built. Although the terms “meaningful”, “large” and “reasonable” are not quantified precisely and thus open to interpretation, they can nevertheless be concretely measured against the capabilities of existing code search engines

Hypothesis 2

A dependency-aware code search engine of the kind referred to in Hypothesis 1, which allows users to express the dependency relationships they desire between code elements when defining queries, can enhance the precision of search results.

The premise behind this hypothesis is that if (a) users can precisely express what software structures they are looking for, containing the precise kinds of dependencies they need, and (b) a search engine can effectively deliver results from a large repository that stores these kinds of dependencies, the

(28)

number of unsuitable (i.e. undesired) software structures returned in search result sets will be reduced. Technically, this corresponds to an increase in the precision of the search engine measured against the “true” requirements of the user. This increase in precision is only expected for dependency-containing queries. There is no claim that the precision of normal (i.e. dependency-free) queries is increased, although this might be a side-effect of the graph-based implementation approach used to support dependency-aware searches.

Hypothesis 3

A dependency-aware code search engine of the kind referred to in Hypothesis 1 can enhance the (local) recall of search results.

This is the “flip-side” of hypothesis 2, but related to recall rather than precision. The premise behind this hypothesis is that if (a) users can precisely express what software structures they are looking for, containing the precise kinds of dependencies they need, and (b) a search engine can effectively deliver results from a large repository that contains these kinds of dependencies, the number of suitable (i.e. desired) software structures included in the result set is in-creased. Technically, this corresponds to an increase in the recall of the search engine measured against the “true” set of suitable components in the repository. Again, this increase in precision is only expected for dependency-containing queries, and there is no claim that the recall of normal (i.e. dependency-free) queries is increased, although this might be a side-effect of the graph-based implementation approach used to support dependency-aware searches.

(29)

1.3 Research Goals 13

Hypothesis 4

A dependency-aware code search engine of the kind referred to in Hypothesis 1, populated by a context-independent harvesting approach, can enhance the (global) recall of search results.

Hypothesis 3 refers to an expected increase in the recall of dependency-aware searches measured in terms of the components that are contained in the search engine’s repository. Thus, the “true” set of suitable components against which recall is judged is the set of query-matching components contained in the repository. In this thesis we refer to this form of recall as “local recall”. How-ever, since we are focussed on “open” code search engines which harvest software from “open” source software repositories and forums publicly acces-sible anywhere on the Internet, the “virtual” repository against which users conceptually issue their search queries is the complete set of components that are in principle harvestable from the open internet. In other words, for users of online, open search engines, the “true” set of suitable components against which recall is judged is, conceptually at least, the set of query-matching components in the openly harvestable Internet. We refer to this form of recall as “global” recall. The key difference between local recall and global recall is that global recall takes into account the effectiveness of the component harvesting/indexing technology as well as the component retrieval technology, while local recall only takes the former into account. Since we believe the context-independent harvesting approach will be able to find and process more components, we believe it will contribute towards an increase in the global recall of the search engine.

(30)

1.4

Thesis organization

The remainder of this thesis is organised in the following way. Chapters two, three and four provide the detailed background needed to understand the problem domain tackled by the thesis and the nature of the proposed solution. Here, chapter two provides a general overview of information retrieval concepts and technologies, while chapter three describes the specific problems that are encountered when trying to apply them to software. Chapter four provides a detailed overview of the most influential code search engines that have been developed to date, both in industry and academia.

With the background established, the next three chapters present the new approaches and technologies proposed in thesis and describes the prototype code search engine, DAISI, developed to demonstrate and test them. Chapter five begins by describing the metamodels that were developed to describe the underlying data structure used to support the new technology. These metamodels exist at various levels of abstraction so that on the one hand they are general enough to be applied to all kinds of pro-gramming languages and software environments, and on the other hand are concrete enough to support efficient application to a tangible example, in this case the Java programming language. Chapter six continues by presenting the technology behind the context-independent harvesting aspect of the new technology. This includes a detailed explanation of how online software component repositories are crawled and analysed, and how the harvested software is stored in a Neo4J graph database alongside multiple Lucene NLP (Natural Language Processing) indexes. Chapter seven continues the detail exposition of the developed technology by describing the dependency-aware search aspect of the approach. This includes a description of the new dependency-aware query language, DAQL, as well as an explanation of the various new kinds of searches and features that can be carried out by users.

(31)

1.4 Thesis organization 15 The final two chapters round off the thesis by discussing the effectiveness and potential impact of the new search engine technology developed in the thesis. Chapter eight presents a concrete evaluation of the developed prototype search engine using increasingly complex search scenarios based on the running example used in the thesis, and demonstrates that all the hypotheses outlined in the previous section are valid, at least for this example. Finally, chapter nine concludes by summarising the main contributions of the thesis and discussing the potential impact and possible future enhancements of the developed technology.

(32)

(33)

2. General Information Retrieval

Concepts

The need for information consists of

the process of perceiving a

difference between an ideal state of

knowledge and the actual state of

knowledge

– Lidwien van de Wijngaert –

The ability to build new software applications by assembling reusable components, rather than developing code from scratch, has been the dream of software engineers for decades. However, the industry is still a long way from realizing this vision in mainstream software engineering. The vision was first framed in 1986 by McIllroy, but serious research on the topic was kick-started by Kruger in 1992 with his definition “Software reuse is the process of creating software systems from existing software rather than building software systems from scratch” [Kru92]. Since then many researchers have worked on trying to turn this vision into practical reality. Although good progress has been made, significant parts of the reuse process still present serious obstacles. One of the biggest challenges is the general problem of finding suitable components to reuse in the first place. As recently as 2001, Sommerville stated that the search for software components is one of the most

(34)

critical elements of an effective reuse program [Som01]. Moreover, Singer et al. in 1997 [Sin+97] and Murphey in 2006 [MKF06] characterized the search for code as one of the most critical stages in the software reuse process.

The search for software components differs in some important aspects to searches for other kinds of artefacts. This is because of searches for software components are essentially searches for “behaviour” or “functionality” whereas searches for most other kinds of artefacts (physical or informational) are essentially searches for “properties”. Since general search engine technology is ultimately driven by property matching, it is much easier to find suitable components based on properties rather than behaviour. However, in the case of software components, there are so many ways in which functionality can be mapped to code, it is very difficult to judge what concrete properties a search engine should be looking for when trying to identify suitable components. The many options for implementing a given piece of functionality arise not only because of the availability of many different programming languages, but also because there are usually numerous architectures and algorithms that can be used to solve a particular problem. Moreover, the strings used to name the classes and methods within a program are entirely at the discretion of the developer. Precisely these variations complicate the search for software components and have challenged researchers for many years.

Key questions for researchers include the structure and form of a query language for software components and the structure of the index or the database that stores the repository of components to be searched. Effective answers to these questions must take into account the many idiosyncrasies of software implementation technologies and the many different ways software engineers can map a given set of requirements into corresponding implementations. To date, most research has focused on handling the naming variations that can exist in a domain, to ensure that identifiers used in component implementations somehow match those used to formulate the search

(35)

2.1 Recall and Precision 19 query.

Initial attempts to tackle these problems were rather simple, such as the Japanese “factory approach” for software development [Cus89] which various Japanese

com-panies introduced to try to support a reuse program [Mat84]. These mechanisms and approaches have been continuously improved over time, however, so that today naming differences and implementation variations can be handled reasonable well. Therefore, information retrieval techniques still form the backbone of software search engines. This chapter provides an overview of these techniques and evaluates their relevance for software search engines today.

2.1

Recall and Precision

Information retrieval (IR), a term introduced in the late 1940’s by Calvin Mooers [Moo50], has its origins in library science. The era of computer-supported retrieval, which can be traced back to the late 1940’s [Cle91] [Lid05], arose from the need to archive the rapidly growing number of newspaper articles and scientific papers caused by the emergence of computing technologies. However, the first widely-accepted definition of information retrieval was made nearly 20 years later by Gerald Salton [Sal68]:

“ Information retrieval is a field concerned with the structure, analysis, organi-zation, storage, searching, and retrieval of information.”

Given the limited hardware and programming capabilities at that time and the original focus on physical books, scientific articles, newspapers and later e-mails, the initial types of searches supported were very simple. For many years it was only possible to specify the author, the title or some specific key word characterizing the desired

(36)

article. However, with the invention of the World Wide Web (WWW), these early technologies were no longer sufficient and a new generation of IR technologies was required. Among other things this new generation had to cope with significantly more information. For example, in the 1960’s the data stored in a new IR system was only about 1.5 megabytes, whereas today billions of online information artefacts are available which must be analysed and made searchable.

Nevertheless, the basic concepts used to measure the effectiveness of search en-gines are still the same. One of most fundamental is “relevance” [CMS10] which expresses how well the items returned by an information retrieval system match the requirements or goals of the users. Basically, a document can be seen as relevant if it contains the information or the properties the user of the search engine is looking for.

Relevance can actually be split into two categories. One is thetopicalrelevance and the other theuserrelevance. The difference between these two relevance categories lies in the information the retrieved documents contain. In the case of topical relevance, the search results must fit only to the topic of the search query, whereas in the case of user relevance additionally conditions and criteria must be satisfied as well. For example, one of the results returned by a search for the German word “Fussballweltmeister” (Soccer world champion) on one of the biggest existing search engines is “Italy”. From one point of view this may be correct and relevant since Italy won the world cup four times in the past. However, Italy is not the current world champion and is thus not a relevant result for a user searching for the last team which won the competition. Exactly these kinds of factors contribute in subtle ways to the user relevance of search results [CMS10].

These different forms of relevance are important in the area of software search engines. For example in the previous chapter, a search for components which use a class calledCustomer could also deliver results which only have the word

(37)

2.1 Recall and Precision 21 name. From a topical relevance point of view, this is a relevant result, but in terms of the behaviour required by the user it is completely irrelevant. Relevance, by itself, is not a sufficiently accurate measure, therefore. The quality of search results strongly depends on the expectations of the user. In 1961, Cyril Cleverdon therefore introduced the metrics “precision” and “recall” to better quantify the effectiveness of IR systems and to allow their ranking algorithms to be validated and compared [Cle61].

Precision:the fraction of the documents retrieved from the search

base that are relevant to the user’s information need.

Recall: the fraction of the documents in the search based that are

relevant to the query that are successfully retrieved.

Both metrics are still widely used today to evaluate new IR approaches. However, the accuracy of the second requires knowledge of the total number of relevant documents in the search base, which is very difficult to establish in practice, especially given the size of search bases today (e.g. billions of documents from the global Internet) and the rapid rate at which they change [GP99]. To address this problem, there are several models for describing the conceptual space a search engine operates in. Among the oldest IR models are the Boolean Retrieval Model and the Vector Space Model [B+99]. These models, which are both still used today, are the foundation of many subsequently developed approaches. As shown in figure 2.1, at the core of every IR model is a quadruple[D,Q,F,R(q_i,d_j)], where D is a logical view on the documents in the collection, Q is a logical view of the user queries, F is the framework for representing the documents and queries andR(q_i,d_j)is the ranking function.

(38)

D

Q

d

j

q

j

R(d

j

,q

j

)

document logical view

query logical view

Figure 2.1: IR Model

2.2

Boolean Retrieval

The Boolean Retrieval Model, sometimes also known as exact-match retrieval, was used in some of the first IR systems. As its name suggest, the basic idea is to include only exact matches in the result list. In other words, results are only included in the result list if they contain all of the keywords specified in the search query and satisfy all the Boolean constraints it implies. The term “implies” is used here because a query can be a complex expression whose parts are connected by logical operators. In the beginning only the basic and well know logical operatorsAND,ORandNOT

where supported, so a query could take the formq=q_a∧(q_b∨q_c)¬q_d, whereq_ican be also sub-expressions of the query q.

While it is relatively easy for a user to formulate such a request, formulating queries in the Boolean Retrieval Model has some weaknesses. Since it assumes that rel-evance is also binary and all “true” results included in the result set are basically equivalent to one another, the approach provides no inherent ranking mechanism. Apart from some ordering strategies (such as sorting by date) therefore, the user can often receive quite a lot of irrelevant results, even though the overall precision is high. It can therefore be quite a complex task to formulate a search query to

(39)

2.3 Vector Space Model 23 get results with a high degree of user relevance. The core problem here is that the Boolean Retrieval Model does not consider how often or in what kind of context the key words are present in the document. This weakness still persists in some of the more recent enhancements to this model, such as the introduction of the wild-card character or mechanisms to search with regular expressions. Other extensions, however, have integrated a ranking mechanism, mostly by combining the Boolean Retrieval Model with the Vector Space Model [SFW83], presented in the next section.

Figure 2.2: Boolean Retrieval Model Query conjunctive components

2.3

Vector Space Model

The Vector Space Model was one of the earliest IR retrieval approaches (alongside the Boolean model) and became one of the main focus of IR research in the 60s and 70s [B+99] because it offers some key advantages – namely, ranking, term weighting and relevance feedback. It achieves this by regarding all documents and queries as vectors in a t-dimensional vector space, wheretis the number of index terms within a document, like words or sentences. The vector itself is composed of index terms

(40)

of all documents are then combined into a matrix which constitutes the index, where each row stands for a document and each column for the weightings of the different terms within that document.

Term₁ Term₂ . . . Term_t Doc₁ d₁₁ d₁₂ . . . d₁_t Doc₂ d₂₁ d₂₂ . . . d₂_t

..

. ...

Doc_n d_n₁ d_n₂ . . . d_nt

Search queries in the Vector Space Model are represented in a similar way as vectors of the formQ= (q₁,q₂, . . . ,q₃)whereq_j again represents the weighting, but in this context the weighting of the j-th term within the query. Since a query vector is always as long as a column in the index matrix, the results can be determined with the help of a distance calculation and a determination of its resemblance to the matrix. Over the years, cosine correlation has proved to be the most effective resemblance and distance measure.

The biggest disadvantage of the Vector Space Model is the number of dimensionst. The approach simply does not scale up to the billions of documents that are today harvestable over the Internet because the dimensiontsimply become too large to manage and the time taken to calculate distances becomes to long. This problem was recognized in early uses of this approach in terms of the sizes of vectors for certain books. Vector normalization therefore became a common practice to ensure all vectors have a uniform length.

A big advantage of the Vector Space Model over the Boolean Retrieval Model is that the distance calculation yields a non-binary measure of a component’s relevance which can be used as the basis for ranking. However, this also means that every

(41)

2.4 Set-based Model 25 time the search query is changed the distance measures have to be re-calculated [CMS10]. Nevertheless, this inherent ranking capability laid the foundations for new approaches addressing the relevance problem and improving ranking still further.

2.4

Set-based Model

Like many other IR models, the Set-based Model introduced in 2002 has its origins in the Vector Space Model [Pôs+02]. One of the main improvements introduced in this model is to combine rules from the association theory [AIS93] with the vectorial ranking mechanism. This allows it to take inter-dependencies between different index terms into account and include documents in the result list which do not actually contain any of the specified keywords in the search query. Instead of storing the terms in documents directly in an index, this is achieved by referencing “termsets” where each termset contains all correlated terms that express the same meaning in different words. This approach was the first mechanism for relating different documents to one another, where the relationship between two documents is determined by the degree of similarity of the referenced termset. Instead of the general termsets the Set-based Model makes use of so called closed-termsets. By reducing the computational complexity and the amount of data to be stored without losing information, closed-termsets have two important advantages over maximal termsets or frequent termsets [Pôs+02].

“A closed termset,cs_i, is a frequent termset that is the largest termset among the termsets that are subsets ofcs_i and occur in the same set of documents. That is, given a setD⊆Dof documents and the setS_D⊆Sof termsets that occur in all documents fromDand only in these, a closed termsetcs_isatisfies the property that_@s_j∈S_D|cs_j⊂s_j.” [Pôs+02].

(42)

These closed-termsets are used to describe each document and every termset is associated with a weighting pair. The first element in the pair identifies the impor-tance of the termset within the document itself and the second the imporimpor-tance of the termest within the index. These weighting pairs are used for ranking, but in addition the ranking mechanism uses three other factors. First, of course, is the frequency of occurrence of a term in a document, second is the generality of terms, with common words that occur in many documents being down-weighted, and third is a normalization procedure which ensure that large documents containing a lot of different terms are not unduly ranked higher than other documents.

2.5

Graph based IR Models

Although the standard approach in IR is to describe documents as a collection of words in individual index documents, there are other ways of modelling the contents of a document. One is to represent the contained text as a graph in which the nodes are words, whole sentences or even whole documents, and the edges represents the relationships. These can be determined in different ways, depending on the use case using statistical [BHQ03], syntactical [FCC07], semantic [MMD02], orthographic [Cho+07] or linguistic relevance. The high degree of freedom in the structure of a graph makes it possible to describe the non-linear and non-hierarchical structural formalism of natural language in a mathematical way. This, in turn, provides an excellent basis for different kinds of analyses about topological, statistical and gram-matical aspects of a language [BL12b]. To support these possibilities, the underlying hypothesis of the graph-based IR models is that in a coherent text fragment, related words tend to build a network of connections which approximately matches the model a human being constructs in the process of discourse and understanding [HH76].

(43)

2.5 Graph based IR Models 27 This kind of IR model is not new, since IR approaches based on graphs were being explored as early as 1969 by Minsk in the field of semantic IR [Min69]. Many approaches based on his results were developed in subsequent years, like neural networks, ontologies or associative networks. For example, one of the first neural networks, the Hopfield net, was used to model information in a graph in 1988 [Hop88]. In this network, information was stored in a single layer in the form of inter-connected neurons (the nodes) and their weighted synapses (edges).

A big advantage of these “connectionist” networks is that they fit very well to the Vector Space Model and the probabilistic IR models [BL12b] and greatly assist the ranking process. For example, in the context of the World Wide Web, the PageRank ranking algorithm uses a mechanisms which stores the web pages as nodes and their relations as edges to determine which website is referenced by the largest number of other web pages [Pag+98]. A web page with a lot of incoming relations is ranked higher than the others. Moreover, the information about the other web sites referencing a web site can be used in a higher-order scoring algorithm which also take into account the scores assigned to the other linked web pages.

Thus, the score assigned to a web pageS(v_i)can be influenced by the score of every directly related and indirectly related web pageS(vj), whereOut(vj)is the number of web pages referencing the site andδ is a so called damping factor which decreases

as the distance to the actual node increases.

S(v_i) = (1−δ) +δ∑j∈V(vi)

S(vj)

|Out(vj)| (0≤ δ ≤1)

The use of this ranking approach has increased significantly in recent years, especially in social networks and recommendation systems [Sch+08] because graph structure are an ideal way to identify patterns [WH91] [Sin+09] in such information stores.

(44)

N1 N2 N3 N5 N6 N7 N4 G1 N1 N2 N3 N5 N6 N7 N4 G2

Figure 2.3: Undirected and directed graph [BL12b]

One of the weaknesses of the graph-based approach is its scalability, since perfor-mance can decrease dramatically as the size of the graph grows. Depending on the structure of the graph, calculating scores, or traversing the paths within the graph can be very costly. This happens, for example, if there are nodes in the graph which are referenced by a lot of other nodes and are then visited quite often in the analysis or searching process. In the case of graph-based representations of programs, this would be the case for primitive types, which if modelled as individual nodes would be referenced by nearly every class. However, if these structural characteristics are recognized and such bottle-neck nodes are avoided, the performance of graph-based IR approaches does not differ significantly from other approaches [BL12b].

(45)

3. Information Retrieval

for Software Components

No man understands a deep book

until he has seen and lived at least

part of its contents.

– Ezra Pound –

Since source code is text, software components can be viewed as nothing more than sequences of strings just like books or web pages. Traditional information retrieval (IR) technologies can therefore be used to support the storage and retrieval of software components [SM83] [FB92]. However, although many of the tech-niques developed by IR researchers are helpful, such as correcting spelling error or identifying synonyms, traditional IR technology based on the Boolean Retrieval Model or the Vector Space Model leave a lot to be desired when used for code. For example, there is the mismatch between the properties of natural language and the technical structure and vocabulary of formal programming languages. This primarily creates the problem of how to formulate requests to such a system so that the users receives the relevant documents contained in the database and are not overwhelmed by irrelevant results because traditional free text approaches do not “understand” the special meaning of the concepts in source code. This chapter discusses the problems involved in using traditional information retrieval approaches for supporting the

(46)

retrieval of software components and describes the range of possible solutions. It starts by enumerating the different retrieval approaches devised by researchers and then explains how these effect the problems of judging the relevance of software components. Finally, this chapter discusses a range of realization choices.

3.1

Software Retrieval Methods

Several promising approaches for searching for components using natural language have emerged over the years such as the profiling approach of Maarek 1991 [MBK91]. In his approach, Maarek extracted a certain number of indicators and combined them to create a profile of a component. This profile can be used as the basis for matching search queries to components to find relevant results. However, Maarek had to contend not only with the conceptual limitations of the IR methods available at that time, he also had to grapple with technical implementation details. In particular, full-text indexes as we know them today did not exist that time, so Maarek had to build his own index based on an uncontrolled vocabulary and clustering techniques. As pointed out by Mili 1998, software component search technologies were far from satisfactory in the past, and there was a lot of scope for further research [MMM98]. The problems with component retrieval technologies at that time were not just technical. Another well known problem addressed by numerous researchers was the so called “vocabulary problem” [Fur+87]:

“ No single word can be chosen to describe a programming concept in the best way ”.

As a consequence, mapping natural language concepts used by software engineers when developing solutions to the technical constructs of programming languages is

(47)

3.1 Software Retrieval Methods 31 a challenging task since components can be developed in so many different ways [Mar+04]. Nevertheless, there are several approaches for mapping natural language to software components. For example, the natural language comments found in classes or methods can be used to infer mapping information, and some approaches like Exemplar [Gre+10b] leverage the accompanying documentation to support natu-ral language searches for components, based on the text they contain. Unfortunately, however, the documentation accompanying components is often of poor quality and has its own semantic gap to the implementation [BMW94]. As well as searching for components by natural language, therefore, other researchers investigated ap-proaches that build a special index based on their structural characteristics [ZW95]. This approach has a lot of potential because, as shown by graph-based IR models, software components have a lot of dependencies which could be captured in an index. However, this line of research has been neglected for a long time and has only recently been revived. Instead, the thrust of current research into software component search (or recommendation) has been to adapt and extend traditional IR methods.

In their seminal 1998 paper, Mili et al. subdivided the various retrieval methods into six different categories [MMM98].

1. Information retrieval methods 2. Descriptive methods

3. Operational semantic methods 4. Denotational semantic methods 5. Structural methods

6. Topological methods

However, Hummel et al. regard the sixth categoryTopological methodsas ranking mechanisms [HJA07] rather than search methods, so we will not consider this category further here.

(48)

Descriptive methods

Descriptive methods are related to methods that describe components in a textual way. However, in contrast to traditional information retrieval methods which analyse the complete document (i.e. all the code) they only deal with abstract descriptions of components making references to a few keywords. The main objective of descrip-tive methods, therefore, is to classify components rather than to understand their semantics. When a search request is made, the keywords in the query are checked against a list of keywords that describes the component. The vocabulary problem causes a difficulty here, since the same concepts and classifications can be described in many different ways and different terms [Mit98]. As Mili et al. suggested, the best classifications are usually performed by human beings, such as software engineers or system administrator, but this is totally unscalable to the millions of components today’s software search engines deal with.

Operational semantics methods

As the name suggest, these methods are related to the operational semantics of the executable code and use a unique feature of software called its “functionality”. Podgurski and Pierce mentioned as early as 1993 that software components can be identified by only a few input- and output parameters. Using such characterizations of components it would, in their opinion, be theoretically possible to search for a component using only a few keywords and some input and output parameters. This would require the execution of components in the repository to determine whether they implemented the required input/output mappings [PP93]. However, there are several significant technical problems in doing this on a large number of potential third party components. One is the shear number of logistic issues involved in getting heterogeneous, third party components harvested from the Internet to compile and run in an automated way, given the many dependencies on specific operating systems,

(49)

3.1 Software Retrieval Methods 33 frameworks and other packages that need to be resolved before software can run. Another problem is related to security. In an open-source repository where software components are harvested by automatic crawlers from the Internet, no mechanism currently exists to identify in an automated way what functionality a component actually provides. Therefore, when executing components on a server as part of a functionality-based search, it is quite possible that malicious software could be executed. Nevertheless, solutions to some of these problems have been found, and execution-oriented search engines have a number of advantages as demonstrated by Hummel’s test-driven search technology [HA04]. Innovations built into this tech-nology included the use of traditional IR searches to pre-filter and rank components prior to testing and the use of sandbox mechanisms to protect the system against malicious code.

Denotational semantic methods

In contrast to the operational semantics methods which are based on executing components, denotational approaches aim to discern the semantics of components by analysis. These methods attempts to establish a match between the search query and software components using an additional document providing more information about the component than just the information in the source code. These methods are mainly related to the signature matching search algorithms.

Structural methods

The main difference between the structural methods and the other methods is how the software component is viewed. Instead of looking at the functional properties of the code, the focus is placed on the structural characteristics of the component such as the kinds of patterns used within it. As Mili mentioned, this is the most suitable approach when the goal is not to directly re-use components per se, but rather to take

(50)

them as the basis for further development or as reference examples [MMM98]. In the case of direct re-use with copy & paste, it is more effective to match components to queries based on functionality rather than structure. However, as Mili points out, components with similar functionality often also have similar structure. Although this does not always apply in practice, there is certainly a strong correlation between the two [MMM98]. Research on this topic has primarily focused on the area of clone detection where approaches are being developed to infer the functionality of components from their structure to determine clones [QLS13].

Most code search engines developed to date fall into one or more of the six cate-gories of component retrieval methods defined by Mili. With sometimes only small adaptations to standard IR technology, the first generation of code search engines are able to deliver quite reasonable results [FP94] [MBK91]. However, as Mili pointed out, they also leave a lot to be desired. They not only have difficulty coping with the syntax of programming languages, but also with their structural characteristics which have changed significantly over time. For example, the structural changes to programs introduced by object-oriented languages created major challenges for structural searches. With the advent of these languages it became possible to imple-ment a function not only in a single class, but also to distribute the impleimple-mentation over several classes. For an information retrieval system that is designed to compare a search request only with a single document rather than a collective of documents, this is a big problem.

Code search engines like Assieme [HFW07], Sourcerer [Baj+06], Portfolio [McM+11] or Codifier [Beg07] have tried to develop various “tricks” to support “structure-aware” searches that are able to take the distributed nature of modern software into account, but these have had only limited success. Other search engines have attempted to use the documentation accompanying components such as Exemplar or LISA, which analysed Twitter messages [AG15], to infer information about their content or quality.

(51)

3.1 Software Retrieval Methods 35 Still other tools have explored the use of graph-based IR methods. For example GraPacc, a Graph-based, Pattern-oriented, Context-sensitive tool for Code Comple-tion, is able give developers code suggestions directly in their IDEs based on the code they are currently working on and a graph-based representation of the already existing method calls [Ngu+12]. The creators of GraPacc claimed they were able to increase the precision of searches to 95% and the recall to 92%. However, this requires developers to have a basic knowledge of the framework being used and to have already written some of the implementation. Another approach in this area is ParseWeb, which supports queries of the form “ source→destination ” [TX07] and also takes sequences of method calls into account. In this approach, code examples are analysed for method calls based on their abstract syntax tree (AST). However, these two tools only consider the method calls and sequences within one class. The first approach to take the actual processes behind the method calls and the classes they involve into account was Portfolio. However, this required a fundamental change to the nature of search queries [McM+11] to take into account the fact that processes can have different paths. This is influenced, for example, by the different ways a class can be initialized or the parameters of methods.

Although this discussion has only covered the most well known search approaches, it already shows the diversity of approaches that researcher have explored and explains the strengths and weakness of the various approaches that have been developed to support component retrieval. It also reinforces Mili et al.’s contention that the quality of component retrieval approaches can be improved in basically one of two ways – by improving the structure of the databases used to store the components and by

(52)

3.2

Search Queries

The way in which users have to formulate search queries has a big effect on the usability and effectiveness of code search engines. Especially the second aspect, user understanding, is a frequently underestimated factor because virtually all search engines today support the classical keyword based approach as a fall-back option if users do not specifically define another type of search. However, they usually also provide an “advanced” query language where it is possible to add more context about the keywords. For example, Koders offers the prefixes “ cdef:” and “ mdef:” which can be used to identify classes or methods with a specific name. If these prefixes are not used in a search query, Koders is unaware of the “meaning” of the following keywords and just searches for them “blindly” like any other keyword.

A search engine query language must provide the features needed to express the full context of the keywords appearing in a query. However, it must do so in a way that is as simple and intuitive as possible for users. Powerful search algorithms and database structures are not going to deliver significant benefits if the associated query language is extremely complex and difficult to use [CMS10]. The ease with which search queries can be formulated and understood has been shown to have a major impact on the perceived usability of code search engines [CMS10]. In general, however, the more information that can be conveyed in a query the more effective the corresponding searches are likely to be.

Haidoc developed a tool named Refoqus, which is able to predict the quality of the search results likely to be generated by a query and makes suggestions for reformulating it [Hai+13]. For example, when searching for code it is important to know what information relates to a class and what information relates to methods within it. Such a query reformulation approach can recommend how to change the query so that it will deliver the most relevant results. Another approach for

(53)

3.2 Search Queries 37 improving query languages is the Program Query Language (PQL) developed by Martin et al.[MLL05]. The PQL allows users to formulate queries using structural information such as the order of method calls to an object. The combination of textual and structural information in search queries was introduced several years later by Wang, who developed a search query lang

A dependency-aware, context-independent code search infrastructure