9.3 Visual to Layout-based GUI test scripts translator (Proof of Concept)
9.4.1 Experiment Design
To perform the evaluation, TOGGLE was applied on two test suites that were developed on Android open-source applications, available both on GitHub and on the PlayStore: Omni-Notes v6.0.02 and PassAndroid v2.5.03. The applications were chosen because of the differences they exhibited in the way their GUIs were built, and in the different operations to perform on the activities to go through their principal usage scenarios.
Each of the test suites was made of 30 independent test cases (i.e., a failure in one test case does not influence the result in test cases that are executed later). Test cases were built based on the Espresso commands recognized by the Enhancer (i.e., all the Espresso ViewActions except for ScrollTo and PressIMEActionButton), and were composed by a number of interactions comprised between 4 and 18, including checks.
Each test case was translated with TOGGLE, to the six destination syntaxes detailed in the previous section. All the generated visual test cases were executed ten times, to evaluate their robustness. The machine on which the executions were performed is an Intel i7-8550U 1.80GHZ clock, with 16GB RAM and Windows
2https://github.com/federicoiosue/Omni-Notes 3https://github.com/PassAndroid
9.4 Experimental Validation 143
10 operating system. The emulated AVD for the execution of the apps was a Nexus 5X with API 25 installed, with enabled device frame and enabled keyboard input. Executions of visual test scripts (or Java code embedding image recognition API calls) were performed on a solid black background, to minimize the possible interference of other visual elements appearing on-screen at the same time of the AUT.
RQ5 can be split into two sub-questions, each related to a different non-functional property measured for generated 3rd-generation test cases. First, to understand the dependability and the robustness of generated test suites, we gathered insights about the percentage of failing and passing executions of generated test cases. Hence, RQ5.1 could be formulated as:
RQ5.1 : What are the differences in reliability between the six combinations of visual test script techniques?
To answer RQ5.1, we relied on the Success Rate (SR) metric, that can be computed for each test case as
SRt= Ns/Nex, (9.1)
being Nsthe number of executions ending with success, and Nexthe total number
of executions of test case t, in the experiment fixed to 10 for all the generated test cases.
Based on the SR metric, test cases were labeled in three different classes:
• Passing, when all 10 executions of the test case ended with success (SR = 1);
• Failing, when all 10 executions of the test case ended with failure (SR = 0);
• Flaky, when some of the 10 executions of the test case ended with success, and some other with failure (0 < SR < 1).
It is assumed that flakiness is due to imprecisions of the image recognition algorithm, while failing test cases are considered the consequence in translation errors or intrinsic limitations of the visual testing tools. This unpredictability was
Reason Sleep time
Long-click 600ms
Swipe 200ms
Multiple key press (e.g., Ctrl + M) 20ms
Replace text 50ms
Post-interaction sleep 1000ms
EyeAutomate failure 5000ms
Sikuli failure 5000ms
Table 9.7 TOGGLE: sleep times introduced in generated test scripts
expected, as several studies have reported the inherent uncertainty of the outcomes of Visual GUI test executions, especially for those produced with Sikuli [6].
A Fisher’s Exact Test for success (pass or fail) of the test scripts vs. the tool used for the 3rd generation translation was applied, to assess the difference between the alternative tools in terms of the correctness of the execution of the generated test cases.
In addition to the success rate of the generated test cases, the performance of the 3rd generation testing tools was measured and compared to that of Espresso. RQ5.2 can be formulated as:
RQ5.2 : What are the differences in performance between the six combinations of the visual test scripts and the original 2nd generation test scripts?
To answer RQ5.2, the average execution time (Tx) of all the passing test ex-
ecutions was measured. The execution time was normalized by the number of interactions performed inside the test case, in order to make the measures for differ- ent test cases comparable. It must also be considered that, by construction of the translated test scripts, the execution time is not comparable with that of Espresso, because of the static sleep instructions that were introduced in the translated inter- actions, and between each couple of interaction. Table 9.7 reports the added sleep instructions. Sleeps between interactions were added to minimize synchronization challenges, to avoid failures of visual test cases if the visual element on which to interact is not displayed on screen immediately after the execution of the previous interaction. Those sleeps are not needed by Espresso test scripts since the tool
9.4 Experimental Validation 145
automatically waits for the required widgets to be loaded on-screen by the Activity code.
An ANOVA test was applied for the execution time (normalized by the number of interactions). First, the effect of the generation (2nd vs. 3rd) was tested; then the effect of the specific 3rd generation tool combined with the app was tested.
9.4.2
Threats to Validity
Threats to Conclusion ValidityTo check the statistically significant difference among different target tools standard statistical tests were applied. The results are clear cut and consistent with the visual representations that report standard (95%) confidence intervals or complete distributions.
Threats to External Validity
The results of this evaluation are not generalizable to any Espresso test suite. Addi- tionally, since the objective of the evaluation is primarily to evaluate the precision of the generated 3rd generation test cases, it did not make sense to use generic Espresso test cases with interactions not supported by the tool.
The conclusions about the reliability and performance of 3rd generation test suites are limited to the considered tools for the evaluation, namely Sikuli and EyeAutomate. The same limited generalizability of the results also applies to the AUTs that were selected. Apps with a very different graphical appearance may induce significantly different results.
Threats to Internal Validity
The results about the performance of the generated 3rd generation test scripts are influenced by the static sleeps added during the translation of 2nd generation test scripts, which by converse need no explicit sleep instructions. In future versions, sleeps may be dynamic, utilizing GUI-state information to determine that components
have loaded properly before proceeding. Dynamic sleeps are perceived to help the performance by mitigating unnecessary waiting time between interactions.
The evaluation of the robustness of generated test cases is based on the assump- tion that all the operations have been performed correctly if the final state of the application is verified. This assumption does not take into account the possibility – albeit unlikely – that multiple wrong operations on the widgets, during a single test case, may compensate each other leading the test case to success at the final visual check.