3.3 Mining of Android repositories from GitHub
3.3.1 Search for Android projects
The first operation to conduct in order to obtain the context of tested open-source Android projects was the extraction of all Android projects hosted on GitHub. To that extent, the GitHub Repository Search API8has been leveraged. Such API allows extracting all the repositories that contain a given keyword in their names, readme files or descriptions. The Search API also allows, with the language parameter, to filter the projects also according to the programming language they are written into. The created parameter of the API allows instead to filter the GitHub repositories that have been created in the interval between the two dates passed as parameters. Since the GitHub API has an upper limit to 1000 maximum repositories returned by each
7http://eyeautomate.com/eyeautomate.html 8https://developer.github.com/v3/search/
3.3 Mining of Android repositories from GitHub 35
call, the created parameter has been used to cycle with different queries over a set of disjoint date ranges, in order to obtain fewer than 1000 results for each of them.
The GitHub API has been used inside a bash script, using the cURL bash function, and the results (which are provided by the API in Json format) have been examined automatically using the jsawk tool9. The data mining process has been performed between September and December 2016.
The resulting search string, using the GitHub search API, is the following one:
c u r l −x , −u $USER : $PASSWORD −H ’ A c c e p t : a p p l i c a t i o n / vnd . g i t h u b . v3 . t e x t −m a t c h + j s o n ’ ’ h t t p s : / / a p i . g i t h u b . com / s e a r c h / r e p o s i t o r i e s ? q= a n d r o i d + l a n g u a g e : j a v a + c r e a t e d : " ’$CURR_DATE_RANGE’ "& s o r t = s t a r s &o r d e r = d e s c&p a g e = ’ $CURRENT_PAGE’ ’
A second filtering step has been then applied to the Android repositories, in order to cut out from the context all the repositories which did not have a release history. This has been done because the final aim of the studies on the context of Android projects was to track the evolution of test code, and the occurrences of fragilities inside it. Hence, projects without at least another tagged release in addition to the masterwere removed for the context, because they did not allow even for a single comparison between two consecutive releases. The Git Tags API10, which outputs the names of all the tagged releases of a given GitHub repository, has been leveraged to this purpose, with the exclusion from the selected set of all projects which returned a single tagged release.
A set of repositories obtained by searching the word “Android” may include actual Android applications, but also spurious results, e.g. libraries, utilities and applications for other systems that have interactions with Android devices. Hence, a heuristic was needed to filter out those spurious projects automatically and as much accurately as possible. The method used by Das et al. [36] was adopted: a given repository was considered an actual Android application if it contained one (or more) Android Manifest file. A Manifest file11is a mandatory file (with that exact name) for any Android app, since it contains essential metadata for the building, installation and
9https://github.com/micha/jsawk
10https://developer.github.com/v3/git/tags/
use of the application (e.g., all the components of the app, the required permissions and hardware features are specified in it). Hence, repositories that do not contain any Manifest file are cut out from the context. Repositories with multiple Manifest files (that may suggest either the presence of multiple builds for a single app, or the inclusion in the same repository of multiple apps) have been evaluated as single projects for the subsequent investigations, since the metrics described in following sections apply to whole repositories and not necessarily to single apps.
To cut out the projects without Android manifest files, the GitHub Code Search API12 was used. The API offers the filename parameter, to search for keywords inside files with a given filenames.
The GitHub Code Search API has some limitations, which however have been considered not very relevant in the context of our study. As explained in its documen- tation, (i) only the default branch (the master branch in most cases) is considered for code search; (ii) only files smaller than 384 kb are searchable; (iii) only repositories with less than 500,000 files are searchable. The second and third issues were not considered a concern for searching Android applications, since the typical size of such projects is not particularly big.
For the purposes of this study, the GitHub Code Search API has been parameter- ized using the keyword manifest, and AndroidManifest.xml as the required filename, as in the following:
c u r l −x , −u $USER : $PASSWORD −H ’ A c c e p t : a p p l i c a t i o n / vnd . g i t h u b . v3 . t e x t −m a t c h + j s o n ’ ’ h t t p s : / / a p i . g i t h u b . com / s e a r c h / c o d e ? q= m a n i f e s t + f i l e n a m e : A n d r o i d M a n i f e s t . xml+ r e p o : l i g i / p a s s a n d r o i d ’ | j s a w k ’ r e t u r n t h i s . i t e m s ’ | j s a w k ’ r e t u r n t h i s . p a t h ’
To limit the context only to applications that were provided with an actual GUI, a second heuristic was adopted, cutting out all the projects that did not feature at least an occurrence of a call to the setContentView method, or a declaration of a FragmentTransaction object. The selection of those methods has been done because setContentView is typically the first method called in the onCreate() method of any Activity, and has the role of populating the screen with the widgets described in a layout resource. The FragmentTransaction, on the other hand, is used to handle
3.3 Mining of Android repositories from GitHub 37
Fig. 3.1 Search procedure for Android projects and test classes associated with the considered testing tools
the creation of a Fragment to populate the activities instead of using the static function to load a layout. Such filtering was performed, again, leveraging the GitHub Code Search API, passing the names of the described function and object as search keyword.