Validating Evaluation Themes - The playthrough evaluation frameworkreliable usability evaluat

to Nielsen’s original study. Following this,Chapter 5(Exploring Evaluation Resource Speciﬁcity) qualitatively reﬂects on the evaluation process with evaluator interviews and content analysis of the heuristics in order to reveal the reasons for the evaluator disagreements.

4.4 Validating Evaluation Themes

Section 4.3 (Heuristic Evaluation) showed that individual evaluators assigned widely diﬀerent ratings to the same heuristics, exhibiting lowinter-rater reliabilitydue to diﬀerences in subjec- tive interpretation. Despite failing to be validated reliably, many of the heuristics are similar to one another and seem to address common themes.

This section reconsiders the data for further analysis, following the same approach used by Nielsen in his original study (Nielsen, 1994a). The research question explored is whether there is any cohesion in the ratings that would validate general themes for evaluation rather than speciﬁc heuristics.

4.4.1 Background

Chapter 4(Testing Heuristic Evaluation for Video Games) was designed to be similar to Nielsen’s original publication (Nielsen, 1994a) where his canonical 10 heuristics were derived. Nielsen rated many diﬀerent sets of heuristics for how well they described the usability issues reported in several earlier studies. As only a single rater was used, Nielsen did not consider reliability. Instead his method was to apply a statistical approach,principal components analysis, to ex- plore the relationships between the many diﬀerent heuristics used, and to uncover underlying similarities in the themes addressed by the heuristics.

4.4.1.1 Principal Component Analysis Shows Similarities in Variables

Principal components analysisis a formally deﬁned as a statistical method which reveals linear functions of unknown components, from a set of observed data with known variables. The analysis shows the relationships between these variables, and groups them into clusters or “components” of related areas.

The analysis shows how much of the variance in ratings is due to each of the individual components, giving an indication of the relative importance of each. In the current context, the heuristics are the variables analysed. Each variable is also described as “loading” to each component to a certain degree. This loading to a component is the correlation coeﬃcient with the heuristic, where higher absolute values indicate stronger correlation, and negative values indicate negative correlation. Typically several similar heuristics will load strongly to single component, indicating that they all address a common theme.

4.4.1.2 Principal Component Analysis Requires Interpretation

Principal components analysis produces statistics that need to be interpreted as the method does not in itself deﬁne clear-cut conclusions, only data that need to be considered.

Heuristics that have high loadings to a particular component are more strongly correlated together, indicating that evaluators rated these heuristics similarly to one another. Low component loadings on the other hand indicate that the evaluator’s ratings of the heuristics were not strongly correlated together. Every heuristic has a loading value for each component, so a decision must be made as to which heuristics are most relevant to each component. There is no universal loading threshold at which it can be said that a variable is considered to be a sig- niﬁcant contributor to a component. Instead this decision is up to the analyst’s interpretation. Similarly the number of components to include is also a matter for interpretation, and a variety of approaches can be used.

The Kaiser criterion is the default option in SPSS and most other software. This criterion only includes components that are relatively strong candidates, indicated by a mathematical property of the component, the “eigenvalue”, having a magnitude of at least 1.0 (Kaiser,1960). A common procedure advocated in Cattell (1966) is to visually analyse a scree plot of the components and the variance in ratings they are each responsible for. In this plot the components are ordered by decreasing amount of variance. Often there will be a number of components with higher values for variance, with a clear “step” where the plot decreases no- ticeably to the components that have lower variance values. This “elbow” can be used as the cut-oﬀ point at which components with lower variance values are discarded.

Alternatively, the number of components to accept can be determined by the sum of variance they collectively contribute. For example, the analyst could decide to accept the number of components that together explain 75% of variance in evaluator ratings.

A simpler approach is to decide that only a pre-determined number of components will be used, such as only accepting the top 10 components.

Regardless of any other approach used, in all cases the components should be subjectively inspected in order to determine whether they represent coherent themes. This is particularly important for components that account for smaller amounts of the total variance, and where heuristics only load weakly. In these instances it may be the case that the heuristics are not thematically related, but have been included as a component merely through the coincidental rather than meaningful co-variance in their ratings.

4.4.1.3 Nielsen’s Principal Component Analysis

Nielsen’s study looked at the relationships amongst 249 usability problems identified during heuristic evaluationof 11 different systems, using 101 heuristics from 7 different sets, rated by a single evaluator, summing to 25,149 individual evaluations. The study found that 53 components were needed to explain 90% of the variance amongst ratings of the heuristics. 25 components accounted for 1% or more of the variance, summing to 62% of the variance in total. There was a gradual decline of variances in the scree plot, so no clear cut-off point was identified. Instead the top 7 components that accounted for the most variance (30% of the total) were arbitrarily chosen to be included in the analysis. The loaded variables were examined, and components were given descriptive names to represent the common areas they addressed. The top 7 components and the amount of variance they accounted for is shown in Table 4.1 (“Nielsen’s 7 components and variances”) on the next page

Table 4.1: Nielsen’s 7 components and variances

Variance Component

6.1% Visibility of System Status.

5.9% Match between system and real world. 4.6% User control and freedom.

4.2% Consistency and standards. 3.7% Error prevention.

3.1% Recognition rather than recall. 2.8% Flexibility and Eﬃciency of use.

These components were then used to inform the derivation of his canonical set of heuristics. A similar approach was used in the study presented in this chapter. Principal components analysis was used to identify the underlying themes addressed by the heuristics rated inSection 4.3(Heuristic Evaluation). These themes were then used to derive novel evaluation resources in later chapters of this thesis.

4.4.2 Method

Principal components analysis was conducted on the heuristic rating data from Section 4.3 (Heuristic Evaluation). In that study 88 usability issues were considered within one system, employing 149 heuristics from 6 diﬀerent sets, and rated by 3 evaluators, summing to 38,544 ratings in total. Following the recommendation from Kaiser (1960), Varimax rotation was applied, as was Kaiser Normalisation. Additionally, as suggested by Stevens (2002), components were only considered if they loaded with absolute magnitudes of 0.40 or greater. The software used was SPSS Statistics 19.

Three separate analyses were produced, one for each of the three evaluators, as well as a composite of all of the evaluators’ aggregated data.

4.4.3 Results

Theprincipal components analysisshows how the evaluators’ ratings of each issue vary with respect to the other heuristics, revealing underlying themes that are common across separate sets of heuristics.

Setting Thresholds

In order to identify how many components should be included in the analysis, scree plots were produced. Appendix A.3(Scree Plots) presents the plots for each individual evaluator and the aggregate of all three evaluators’ ratings. As with Nielsen’s data, in each case the graphs show similar results with a gradual decline in the slope of the curve. There is no characteristic “elbow” or drop-oﬀ in the variance contributed by each component that could indicate a threshold for including components in the analysis.

As no clear threshold was seen in the plots it was necessary to reﬂect on the heuristics included in each component in order to determine which components represented a coherent theme. Generally the components that contributed greater amounts of the total variance in ratings included heuristics that were similar to one another. Components with lower variances tended to include heuristics that were unrelated. The analysis produced these components as the ratings of these heuristics were coincidentally similar, but not in a meaningful way.

Identifying Similar Components

Each component produced by the analysis consists of multiple heuristics that tended towards being given the same rating by the evaluator. Although the three evaluators ratings exhibited lowinter-rater reliabilityin theheuristic evaluation, the components identiﬁed by theprincipal components analysisshow similar groupings of related heuristics.

In order to make sense of the data the variables loading on each component need to be inspected and a descriptive name assigned to represent the area they address. For example, the component that generally accounted for the greatest amount of variance brings together heuristics related to skills that the player learns during the game. As such this component was given the descriptive name, “Learning Skills”.

There were noticeable similarities in the components identiﬁed for each evaluator, so in the next stage of analysis similar components were grouped together. For example, all three evaluators had components that addressed the player’s ability to learn and use skills needed to accomplish game tasks. To demonstrate these similarities, loadings for the heuristics asso- ciated to these components are shown in Appendix A.5 (Component Loadings). These tables show the heuristic loadings for the component, “Learning Skills”. Loadings are listed for each of the individual evaluators’ components that addressed this theme as well as a separate ﬁgure for the heuristic loadings on the component aggregated from all three evaluators’ ratings. All three evaluators had components that deal with the issue of learning skills, and in most cases the same heuristics appeared in each component, and often in the same order. For example, Heuristic 68: “Player given opportunity to model correct behavior and skills” contributed the second most variance of the ratings for evaluator 1, and the most variance for evaluator 3 and evaluator 2.

Drawing out Meaning in the Components

The heuristics loading on these grouped components were inspected in order to determine which brought together heuristics that were coherently related to one another, and which may have be included through random chance. 21 meaningful components were identiﬁed across the three analyses. The heuristics further inﬂuencing the remainder of the components were increasingly unrelated to one another, and the amount of variance they contributed continued to decrease. Descriptive names were assigned to represent the areas addressed by the related heuristics. Appendix A.4(Component Variance) shows the descriptive names for these components, along with the amount of variance in ratings that they each explain. Separate tables are presented for the individual evaluators’ ratings, as well as the aggregate of all three evaluators’ ratings.

The aggregate analysis found that 37 components explain 77.10 % of the total variance in the ratings. The final 21 components identified collectively account for 57.85% of the variance, each of which contributes greater than 1% to the total. The top 5 components each explaining more than 3% of the variance account for 22.49% of the total variance. These figures are in a similar range to those reported by Nielsen.

In document The playthrough evaluation framework reliable usability evaluation for video games (Page 97-101)