Interrater Reliability - Reading Statistics Huck

Researchers sometimes collect data by having raters evaluate a set of objects, pictures, applicants, or whatever. To quantify the degree of consistency among the raters, the researcher computes an index of interrater reliability. Five popular procedures for

doing this include a percentage-agreement measure, Pearson’s correlation, Kendall’s coefﬁcient of concordance, Cohen’s kappa, and the intraclass correlation.

The simplest measure of interrater reliability involves nothing more than a percentage of the occasions where the raters agree in the ratings they assign to what-ever is being rated. In Excerpt 4.10, we see an example of this approach to inter-rater reliability. This excerpt is instructional because it contains a clear explanation of how the two reliability ﬁgures were computed.

EXCERPT 4.10

• Percentage Agreement as Measure of Interrater Reliability

An agreement was recorded if both observers identically scored the answer as correct or incorrect. A disagreement was recorded if questions were not scored identically.

Percent agreement for each probe was calculated by dividing the number of agree-ments by the number of agreeagree-ments plus disagreeagree-ments and multiplying by 100.

Interrater reliability for the ﬁrst dependent variable was 94.6%. . . . Interrater relia-bility for the second dependent variable was 97.7%.

Source: Mazzotti, V. L., Wood, C. L., Test, D. W., & Fowler, C. H. (2010). Effects of computer-assisted instruction on students’ knowledge of the self-determined learning model of instruc-tion and disruptive behavior. Journal of Special Educainstruc-tion, in press.

EXCERPT 4.11

• Using Pearson’s r to Assess Interrater Reliability

Two high school English teachers, blind to students and to the study’s hypothesis, rated each paper independently [on a 7-point Likert scale]. . . . Interobserver agree-ment for holistic quality for the essays (using Pearson r) was .95.

Source: Jacobson, L. T., & Reid, R. (2010). Improving the persuasive essay writing of high school students with ADHD. Exceptional Children, 76(2), 156–174.

The second method for quantifying interrater reliability uses Pearson’s product-moment correlation. Whereas the percentage-agreement procedures can be used with data that are categorical, ranks, or raw scores, Pearson’s procedure can be used only when the raters’ ratings are raw scores. In Excerpt 4.11, we see an example of Pear-son’s correlation being used to assess the interrater reliability among two raters.

Kendall’s procedure is appropriate for situations where each rater is asked to rank the things being evaluated. If these ranks turn out to be in complete agree-ment across the various evaluators, then the coefﬁcient of concordance will turn out equal to 1.00. To the extent that the evaluators disagree with one another, Kendall’s procedure will yield a smaller value. In Excerpt 4.12, we see a case in which Kendall’s coefﬁcient of concordance was used.

Kendall’s coefﬁcient of concordance establishes how much interrater reliabil-ity exists among ranked data. Cohen’s kappa accomplishes the same purpose when the data are nominal (i.e., categorical) in nature. In other words, kappa is designed for situations where raters classify the items being rated into discrete categories. If all raters agree that a particular item belongs in a given category, and if there is a total agreement for all items being evaluated (even though different items end up in different categories), then kappa assumes the value of 1.00. To the extent that raters disagree, kappa assumes a smaller value.

To see a case in which Cohen’s kappa was used, consider Excerpt 4.13. In the study that provided this excerpt, the researchers examined 225 social networking Web pages created by 17- to 20-year-olds. Each had a reference to alcohol. Two raters classiﬁed each Web page in terms of preset categories (e.g., explicit or ﬁgu-rative reference to alcohol). These categories were fully nominal in nature. Cohen’s kappa was used to see how consistently the two raters assigned the Web pages to the categories.

EXCERPT 4.12

• Kendall’s Coefﬁcient of Concordance

A standardized measure of neurological dysfunction speciﬁcally designed for TBI currently does not exist and the lack of assessment of this domain represents a sub-stantial gap. To address this, the Neurological Outcome Scale for Traumatic Brain Injury (NOS-TBI) was developed. . . . Overall interrater agreement between inde-pendent raters (Kendall’s Coefﬁcient of Concordance) for the NOS-TBI total score was excellent (W .995).

Source: McCauley, S. R., Wilde, E. A., Kelly, T. M., Weyand, A. M., Yallampalli, R., Waldron, E. J., et al. (2010). The Neurological Outcome Scale for Traumatic Brain Injury (NOS-TBI):

II. Reliability and convergent validity. Journal of Neurotrauma, 27(6), 991–997.

EXCERPT 4.13

• Cohen’s Kappa

We evaluated 400 randomly selected public MySpace profiles of self-reported 17- to 20-year-olds. . . . Two authors (L.R.B. and M.M.) conducted the initial evaluation to identify profiles with alcohol references. . . . 225 profiles contained references to alcohol and were included in all analyses (56.3%). . . . Cohen’s Kappa statistic was used to evaluate the extent to which there was agreement in the coding of the web profiles before discussion [resolved differences of opinion]. The Kappa value for the identification of references to alcohol use was 0.82.

Source: Moreno, M. M., Briner, L. R., Williams, A., Brockman, L., Walker, L., & Christakis, D. A. (2010). A content analysis of displayed alcohol references on a social networking web site. Journal of Adolescent Health, 47(2), 168–175.

The ﬁnal method for assessing interrater reliability to be considered here is called intraclass correlation (ICC), a multipurpose statistical procedure, as it can be used for either correlational or reliability purposes. Even if we restrict our think-ing to reliability, ICC is still versatile. Earlier in this chapter, we saw a case where the intraclass correlation was used to estimate test–retest reliability. Now, we con-sider how ICC can be used to assess interrater reliability.

Intraclass correlation is similar to the other reliability procedures we have considered in terms of the core concept being dealt with (consistency), the theoret-ical limits of the data-based coefficient (0 to 1.00), and the desire on the part of the researcher to end up with a value as close to 1.00 as possible. It differs from the other reliability procedures in that there are several ICC procedures. The six most popular of these procedures are distinguished by two numbers put inside parenthe-ses following the letters ICC. For example, ICC (3,1) designates one of the six most frequently used versions of intraclass correlation. The first number indicates which of three possible statistical models has been assumed by the researchers to under-lie their data. The second number indicates whether the researchers are interested in the reliability of a single rater (or, one-time use of a measuring instrument) or in the reliability of the mean score provided by a group of raters (or, the mean value produced by using a measuring instrument more than once). The second number within the parentheses is a 1 for the first of these two cases; if interest lies in the re-liability of means, the second number is a value greater than 1 that designates how many scores are averaged together to generate each mean.

I will not attempt to differentiate any further among the six main cases of ICC.

Instead, I simply want to point out that researchers should explain in their research reports (1) which of the six ICC procedures was used and (2) the reason(s) behind the choice made. You have a right to expect clarity regarding these two issues be-cause the ICC-estimated reliability coefﬁcient can vary widely depending on which of the six available formulas is used to compute it.

In Excerpts 4.14 and 4.15, we see two examples where the intraclass correla-tion was used to assess interrater reliability. Notice that the researchers associated with the second of these excerpts indicate which of the six main types of ICC they used—model 2 for a single rater. Because the coefﬁcient provided by ICC can vary EXCERPTS 4.14–4.15 • Intraclass Correlation

All participants wrote their personal vision, and [then] three raters rated the vision statements according to deﬁnitions of challenge and imagery. . . . In our study, inter-rater reliability was intraclass correlation coefﬁcient (ICC) .93 for challenging and ICC .87 for imagery.

Source: Masuda, A. D., Kane, T. D., Stoptaugh, C. F., & Minor, K. A. (2010). The role of a vivid and challenging personal vision in goal hierarchies. Journal of Psychology, 144(3), 221–242.

(continued )

widely depending on which of the six main formulas are used to obtain the intra-class correlation, we have a right to think more highly about the information in the second excerpt. It would have been even nicer if the authors of Excerpt 4.15 had explained why they chose ICC (2,1) instead of other variations of this reliability procedure.

In document Reading Statistics Huck (Page 101-105)