To provide a ready-to-use framework for Usability-based Split Testing, we have designed a novel context-aware tool called WaPPU, i.e., “Was that Page Pleasant to Use?”. The tool caters for the whole process from interaction tracking to deriving correlations and learning usability models. Based on the principles of Usability-based Split Testing, we have implemented WaPPU in terms of a central split testing service. This service has been realized using node.js11. Split testing projects are created in the WaPPU dashboard (Fig. 5.5), which provides the developer with ready-to-use JavaScript snippets that simply have to be included in the different interfaces-under-test. The only other thing required for deployment of the split test is a client-side jQuery plug-in for component-based interaction tracking. The overall architecture of WaPPU can be seen in Figure 5.5. The current implementation supports at most two interfaces per split test, i.e., only A/B testing is supported. Also, an example configuration is given below.
Listing 5.2: Exemplary WaPPU configuration.
1 W a P P U . s t a r t ({
2 p r o j e c t I d : 42 ,
11http://nodejs.org/ (Feb. 21, 2014).
68 Chapter 5 Usability-based Split Testing
3 i n t e r f a c e V e r s i o n : ’ A ’,
This JavaScript snippet defines an interface-under-test associated with the split testing project with project ID 42 and specifies that an INUITquestionnaire is shown to users of the interface. The amount of hovers and hover time are tracked for the component defined by the jQuery selector#nav. Moreover, the length of the cursor trail as well as the cursor speed are tracked for the component#content.
In the following, we describe the version of WaPPU that features usability scores based on a 3-point scale12. W.l.o.g. the latest implementation of WaPPU13has been adapted to feature a 2-point scale for more reliable predictions in productive settings even with smaller amounts of data. In general, choosing the size of the scale is a trade-off between the precision of the usability measurements (cf. Section 4.9) and the reliability of predictions.
5.6.1 Interaction Tracking
Tab. 5.2.: Complete list of interaction features supported by WaPPU (* = whole page feature only).
label description source
arrivalTime time elapsed from page load till arrival at component
Speicher et al. (2013a) charsDeleted # deleted characters
charsTyped # characters typed
clicks # clicks Speicher et al. (2013a)
cursorMoveTime time the mouse cursor spends moving
Speicher et al. (2013a) cursorRangeX* cursor range on X axis Q. Guo and Agichtein (2012) cursorRangeY* cursor range on Y axis Q. Guo and Agichtein (2012) cursorSpeed cursorTrail divided by
cursor-MoveTime
Q. Guo and Agichtein (2012) and Speicher et al. (2013a) cursorSpeedX cursor speed in X direction Q. Guo and Agichtein (2012) cursorSpeedY cursor speed in Y direction Q. Guo and Agichtein (2012)
cursorStops # cursor stops Q. Guo and Agichtein (2012)
inputFocusAmount # focus events on input ele-ments
cursorTrail length of cursor trail Q. Guo and Agichtein (2012) and Speicher et al. (2013a) cursorTrailX length of cursor trail on X axis Q. Guo and Agichtein (2012) cursorTrailY length of cursor trail on Y axis Q. Guo and Agichtein (2012)
hovers # hovers Speicher et al. (2013a)
12https://github.com/maxspeicher/wappu-service/tree/alpha (Feb. 28, 2015).
13https://github.com/maxspeicher/wappu-service (Feb. 28, 2015).
5.6 WaPPU: Was that Page Pleasant to Use? 69
Tab. 5.2.: (continued)
label description source
hoversPrevHovered # hovers over previously hov-ered text elements
Navalpakkam and Churchill (2012)
multiplyHoveredText # multiply hovered text ele-ments
Navalpakkam and Churchill (2012)
hoverTime total time spent hovering the component scrollDirChanges* # changes in scrolling direction Nebeling et al. (2013c) scrollMaxY* maximum scrolling distance
from top
Q. Guo and Agichtein (2012) scrollPixelAmount* total amount of scrolling (in
pix-els)
Q. Guo and Agichtein (2012) scrollSpeed* scrolling speed Q. Guo and Agichtein (2012) textSelections # text selections
textSelectionLength total length of all text selections
WaPPU adds another 12 features to the 15 already given in Table 5.1. That is, our tool considers 27 well-defined user interaction features, which are summarized in Table 5.2.
They have been derived from existing research (Q. Guo and Agichtein, 2012; Navalpakkam and Churchill, 2012; Nebeling et al., 2013c; Speicher et al., 2013a) as well as additional considerations. charsTyped, charsDeleted and inputFocusAmount have been added because we hypothesize that on a SERP, interaction with the search box correlates with success in finding the desired content. textSelection and textSelectionLenght have been added because from own experience with usability testing we know that some users select the piece of text they are currently reading; hence, we hypothesize that text selections correlate with readability.
The features are tracked for each component defined by the developer, except for features annotated with an asterisk, which cannot be applied to individual components. Moreover, each feature is tracked for the whole web interface (=ˆ component “total”), which gives an additional implicit component.14This gives us the chance to derive the relative distribution of features across the page, e.g., “25% of the total cursor trail lie in the navigation component”.
That is, for each component each feature not marked with an asterisk in Table 5.2 has an absolute and a relative representation. Hence, if a developer defines x components in their web interface and specifies that all features shall be considered, WaPPU ultimately tracks a total of 20(2x + 1) + 7 features during the split test (7 features are applied to the whole page instead of components).
14For cumulative features—such as hoverTime—the value for the component “total” is the aggregation across all components defined by the developer. For features that are a maximum—such as maxHoverTime—the value for the component “total” is the global maximum across all components defined by the developer. For examples, see Appendix E.1.3.
70 Chapter 5 Usability-based Split Testing
5.6.2 Usability Judgments
WaPPU offers the option to automatically show a questionnaire when users leave an interface-under-test, in case they have agreed to contribute training data. This questionnaire contains the seven usability items of INUIT, each formulated as a question and to be answered based on a 3-point scale: , or (bad–neutral–good). Since the value of an item is thus either −1, 0 or +1, we get an overall usability value that lies between −7 and +7. These values are what we refer to as usability scores. The seven judgments are then sent to the server side together with the tracked user interactions. The questionnaire can be shown on either none, one or all of the interfaces-under-test in a split test. If no interface features a questionnaire, the functionality of WaPPU is reduced to collecting interactions only, i.e., for use with usability heuristics (cf. Sec. 5.5.2).
If it is featured on one interface, WaPPU automatically learns seven models in near real-time—
one per usability item—based on the users’ answers. These models are associated with the corresponding split testing project and stored in WaPPU’s central repository (Fig. 5.5). They are automatically applied to the remaining interfaces for model-based usability prediction (cf.
Sec. 5.5.3). The current implementation of WaPPU provides the option to use the updateable version of the Naïve Bayes classifier15or the Hoeffding tree classifier16, both provided by the WEKA API (Hall, Frank, et al., 2009).
Finally, in case all interfaces feature the questionnaire, the developer receives the most precise data possible. This case requires no models and is particularly useful for remote asynchronous user studies from which one can derive heuristic rules for usability evaluation (cf. Sec. 5.5.2). It is not intended for evaluation of online interfaces since the amount of
questionnaires shown to real users should be minimized.
5.6.3 Context-Awareness
The context of a user is automatically determined by WaPPU and all collected interactions and usability judgments are annotated accordingly. In this way, it is possible to integrate context into a usability model since different contexts trigger different user behaviors. Currently, we consider two aspects that to a high degree influence a user’s interactions: ad blockers and screen size. That is, the context determined by our tool is a tuple (adBlock, screenSize) with adBlock ∈ {true, false}and screenSize ∈ {small, standard, HD, fullHD}. For this, we refer to the most common screen sizes and define: small < 1024 × 768 ≤ standard < 1360 × 768 ≤ HD < 1920 × 1080 ≤ fullHD.17 Moreover, developers are provided with an option to include own contextual attributes upon initializing WaPPU in a web interface.
Small-screen and touch devices are not supported in the current version of WaPPU. They are detected using the MobileESP library18and corresponding data are ignored.
15http://weka.sourceforge.net/doc.dev/weka/classifiers/bayes/NaiveBayesUpdateable.
html (Jan. 14, 2015).
16http://weka.sourceforge.net/doc.dev/weka/classifiers/trees/HoeffdingTree.html (Jan. 14, 2015).
17Cf. http://en.wikipedia.org/wiki/Display_resolution (Feb. 12, 2014).
18http://blog.mobileesp.com/ (Feb. 12, 2014).
5.6 WaPPU: Was that Page Pleasant to Use? 71
Fig. 5.6.: Screenshot of the WaPPU dashboard showing the evaluation of the A/B test carried out during the case study presented in Section 8.2. It shows the scores (mean and standard deviation) of each usability factor and the overall usability metric for the two involved SERPs (Interface “A” = original SERP, Interface “B” = redesigned SERP, * significant difference).
5.6.4 The WaPPU Dashboard
WaPPU summarizes the metrics and scores of an A/B test in a dedicated dashboard (Fig-ure 5.6). For the seven individual factors, a score between −1 and +1 is given, corresponding to the bad–neutral–good scale ( , , ) described above. The score of the overall usability metric is then determined by summing up all factors, i.e., it has a value between −7 and +7 that is normalized to a value between 0% and 100%. All scores are visualized together with their standard deviations. The overall score is provided in analogy to SUS (Brooke, 1996).
Bangor et al. (2008) found that the university grade analog (100–90 points correspond to an
“A”, 89–80 points to a “B” etc.) is a good rule-of-thumb for interpreting SUS scores. Moreover, they state that “products which are at least passable have [...] scores above 70” (Bangor et al., 2008). Thus, we set70% as the lower bound for a good usability score in WaPPU. This corresponds to individual factor scores of0.4, i.e., if all usability factors have a score of 0.4, we get an overall score of 70%.
If an interface-under-test features the questionnaire, the eight usability scores displayed in the dashboard are derived directly from users’ answers. Otherwise, WaPPU predicts the scores based on the interactions collected on the respective interface and the available models. For example, if interface “A” displays the questionnaire, WaPPU learns the seven models Mibased on the corresponding answers and interactions ⃗Itracked on “A”: Mi ← learn(answerAi , ⃗IA) ∀i ∈ usability factors. The usability scores of interface “B” are then inferred from these models and the interactions collected on “B”: scoreBi = Mi(⃗IB) ∀i ∈ usability factors.
Additionally, the WaPPU dashboard features a traffic light indicating statistically significant differences between the interfaces-under-test. That is, WaPPU applies a Mann–Whitney U test to the (predicted) overall usability scores produced by all involved users. In case the usability scores of the two interfaces-under-test are statistically equal—i.e., no definite statement about which one is better can be made—the traffic light is yellow, otherwise it is red (“A”
better) or green (“B” better).
72 Chapter 5 Usability-based Split Testing