Inter-rater reliabilitybetween the evaluators proved to be systematically weak across all of the heuristics available in the literature. Reasons for this are summarised inSection 4.6(Conclu- sion) which presents the argument thatheuristic evaluationdoes not produce results with an appropriate degree of reliability forsummativeevaluation of video games. Section 4.4(Validat- ing Evaluation Themes) continues Nielsen’s approach by statistically examining the evaluators’ rating data using principal components analysis. Despite the lowinter-evaluator reliabilityex- hibited in the heuristic evaluation, this statistical analysis reveals underlying patterns in the evaluators’ ratings which suggests that the heuristics do address to important design and eval- uation themes. These themes are unpacked in subsequent chapters and used to inform the development of a novel methodology for more reliable evaluation.
4.2 User Test
Method
A single player first person shooter console game, Aliens Vs. Predator (Rebellion Developments, 2010), was evaluated withuser testingas part of the commercial work conducted by the Vertical Slice game usability evaluation studio.
The game was a high fidelity interactive prototype evaluated shortly prior to release. Only a portion of the game was complete to a level of quality indicative of the final product, and only these sections were tested. Each session lasted approximately one hour, and the whole user testwas conducted conducted over two days in laboratory conditions.
Six participants played the game on an Xbox 360 connected to widescreen HD television. Video cameras recorded the player, and realtime footage from the game console was simulta- neously streamed to the observation room next door. All feeds were composed together on a widescreen HD display, and saved to disk for later analysis. The game’s producer, a senior user experienceconsultant, and colleagues monitored the participants’ play from an observa- tion room. Theuser experienceconsultant had spent some time familiarising himself with the game before the test sessions, and the producer was able to identify when players were not playing the game as the designers had intended.
Each of the observers informally made notes on the participants’ play sessions. At the end of all theuser testsessions the notes were informally aggregated into a final report.
Participants
Six male players were recruited to fit the target demographic provided by the client: four self- identified “mainstream” gamers (19, 22, 20, 20 yrs) and two “core” gamers (22, 30 yrs). The mainstream gamers owned one console and played games approximately once per week, but
did not consider gaming to be a major hobby. The core gamers owned more than one console at home, played games several times per week, and self-identified as gamers.
Results
Following theuser testsessions, a report was produced by threeuser experienceprofessionals, including the senior consultant who ran the session. The report listed the usability andplaya- bilityissues encountered by each participant, as well as some additional suggestions proposed by the senior consultant. In total 88 issues were identified (Appendix C.2.4- Issue Analysis). While making notes and subsequently merging the individual reports it was noted that each observer had recorded issues in the test session somewhat differently to one another, occa- sionally attributing different aspects of the game as being problematic. These differences in interpretation suggested that it would be worth exploring the analysis in more detail by applying heuristic evaluationto the issues recorded in the final report. The main focus of this chapter is on the following stage, which was to evaluate which heuristics were violated by each issue, and hence to validate or refute their applicability beyond their original studies.
4.3 Heuristic Evaluation
In order to explore the user test data in more detail a heuristic evaluation was proposed. However, the proliferation of heuristic sets seen in the literature raises the question of which to use, and how to compare one to another. The video game heuristics in the literature were considered for suitability for testing an first-person shooter game, and those intended for different platforms (e.g., mobile,) domains (i.e., not games,) or genres (such as RTS, etc.) were excluded from further consideration, as were a number of subjective or otherwise non-validated design guidelines (Malone,1980,1982).
A number of lists were rejected due to being superceded (Desurvire et al., 2004), work- in-progress or not formally published in peer-reviewed venues (Desurvire and Chen, 2008; Desurvire and Wiberg,2008; Schaffer,2007). While Korhonen et al. (2009) created their heuris- tics based on evaluation of a mobile game, their structure is modular and the mobile compo- nents were removed to allow assessment of the core gameplay and game usability aspects. Similarly, the list proposed in Pinelle et al. (2008a) was based on video game reviews rather than empirical evidence derived from user testing. However, the large corpus of data from which the heuristics were extracted should serve as a solid basis for evaluation. Likewise the most recent PLAY list (Desurvire and Wiberg,2009) was included in the study, despite being based on game reviews rather than formal user testing. The GAP list (Desurvire and Wiberg, 2010) is specifically intended to address the first experiences of game players, and was explicitly compared againstuser testing, so is an ideal candidate for consideration. Although not being peer-reviewed, the list in Federoff (2002) was derived from commercial game developers, so should have some practical basis, and has been significantly cited in and continues to influ- ence subsequent academic publications. Nielsen’s canonical list (Nielsen, 1994a) was included as a de-facto standard forheuristic evaluation. While it was created with data from different domains (such as productivity systems on textual and telephonic platforms) it was included in
order to compare the validation of traditional and game specific heuristics.
In some cases, heuristics from earlier publications appeared verbatim in latter sets. These exact duplicates were removed, leaving 146 unique entries (Appendix E - 146 Heuristics) re- maining from the following six sources:
• Federoff (2002)
• PLAY Desurvire and Wiberg (2009) • Pinelle et al. (2008a)
• GAP Desurvire and Wiberg (2010)
• Korhonen et al. (2009) (excluding mobile components) • Nielsen (1994a)
Method
Three researchers who had conducted theuser testingin the previous section examined each of the 88 issues reported, and considered them against the 146 heuristics.
Following the procedure used by Nielsen to derive his canonical heuristics (Nielsen, 1994a), each evaluator rated each issue against each heuristic using his 5 point ordinal scale to describe how well it explained the issue:
0. Does not explain the problem at all.
1. May superficially address some aspect of the problem.
2. Explains a small part of the problem, but there are major aspects of the problem that are not explained.
3. Explains a major part of the problem, but there are some aspects of the problem that are not explained.
4. Fairly complete explanation of why this is a usability problem, but there is still more to the problem than is explained by the heuristic.
5. Complete explanation of why this is a problem.
In total 38,544 ratings were made between the 3 evaluators, whereas Nielsen’s study con- sisted of 25,149 ratings from a single evaluator, and whereinter-rater reliability was not con- sidered.
The three evaluators involved were a video game user experience doctoral student with professional experience of conducting user tests; a further video game user experiencedoc- toral student (the author of this thesis), considered as a “double expert” with professional experience of conducting user experience tests and professional game development; and a furtherhuman-computer interactionresearcher. The evaluators participated in a training ses- sion where the heuristics were reviewed, and uncertainty about the meaning or intention of particular heuristics was discussed and consensus reached.
The 88 issues from the user test session were randomly ordered and presented to the three independent evaluators for inspection. Every issue from theuser test session was pre- sented with the 146 heuristics, randomised in a unique way for each evaluator and each of the heuristics. This counterbalancing prevented order effects where repeated evaluation of heuristics in the same order could have influenced the evaluators’ decision making process. Once each of the evaluators had completed their evaluation of all of the issues, the data were collected together for statistical analysis, presented in the following section.
4.3.1 Results
All ratings were inspected for variance between the three evaluators. Extreme variances were frequently identified in cases such as when one evaluator rated a heuristic as 5 (“Complete explanation of why this is a problem”) and the other two evaluators rated as 0 (“Does not explain the problem at all”). Krippendorff’s Alphawas computed across all of the ratings using an online calculator (Freelon,2008) at a value of 0.343 (nCoders = 3; nCases = 12848; nDecisions = 38544), which represents very poor reliability. The computation was repeated with an SPSS macro from Hayes and Krippendorff (2007), and produced the same value.
It is noteworthy that in a study reported by Cockton et al. (2004), only 31% of heuristics were considered appropriately assigned. The alpha found in this present chapter is indicative of a similarly low level of appropriateness. In their later studies which employed structured reporting formats, Cockton et al. found this level increased to 60%. Similar results were reported for the percentage of problems predicted that were discovered duringuser testing.
4.3.2 Discussion
The systematically low levels of inter-evaluator reliability suggests fundamental inadequacies in a weakly specified, discountheuristic evaluationmethod when applied to the more complex scenarios in a video game user test. The original evaluation teams of the six heuristic sets considered here achieved agreement in their own studies through private discussion during evaluation. Without the decisions made during those discussions being instantiated as formal, objective evaluation processes in the methodology, repeatability and validation of their results is not possible.
A possible reason for the ambiguity in interpretation can be attributed to the different phrasing used in each heuristic set. In particular there appears to be a blurring between design guidelines and heuristics. There are a large number of design guidelines for video games, with much literature on the subject (Bateman and Boon,2005; Fabricatore et al.,2002; Falstein and Barwood, 2006). Many of these guidelines however are tentative, subjective and informal. A reliable form of evaluation is needed to be more rigorous, measurable, actionable, and based on empirical data.
Nielsen talks aboutheuristic evaluationas a discount method for cases whereuser testing is not required, for straightforward productivity applications such as telephony and textual interfaces (Nielsen, 1992). The assumption that it is a reliable discount method needs to be considered in more detail for the case of complex video game systems.
This study has suggested that heuristic evaluation is highly subjective, subject to a sub- stantial degree of error, and more rigorous techniques for the assessment of usability issues are needed in order to ensure this evaluation method produces valid, repeatable and valu- able results. At present, heuristic evaluation was not able to be reliably validated for typical first-person shooterconsole video games.
The following sections explore the rating data from this study in order to understand why the evaluators assigned such different values to one another. The purpose of the investigation is to expose the implicit processes used by evaluators, and see whether there is a more reliable way to make use of the design and evaluation knowledge contained in the heuristics. Two
different approaches are used. FirstSection 4.4 (Validating Evaluation Themes) quantitatively