The types of reliability described above are useful primarily for continuous measurements.When a measurement problem concerns categorical judgments, for instance classifying machine parts as acceptable or defective, measurements of agreement are more appropriate.For instance, we might want to evaluate the consistency of results from two different diagnostic tests for the presence or absence of disease.Or we might want to evaluate the consistency of results from three raters who are classifying classroom behavior as acceptable or unacceptable. In each case, each rater assigns a single score from a limited set of choices, and we are interested in how well these scores agree across the tests or raters.
Percent agreement is the simplest measure of agreement: it is calculated by
dividing the number of cases in which the raters agreed by the total number of ratings.In the example below, percent agreement is (50 + 30)/100 or 0.80.A major disadvantage of simple percent agreement is that a high degree of agree- ment may be obtained simply by chance, and thus it is impossible to compare percent agreement across different situations where the distribution of data differs.
This shortcoming can be overcome by using another common measure of agree- ment called Cohen’s kappa, or simply kappa, which was originally devised to compare two raters or tests and has been extended for larger numbers of raters. Kappa is preferable to percent agreement because it is corrected for agreement due to chance (although statisticians argue about how successful this correction really is: see the sidebar below for a brief introduction to the issues).Kappa is easily computed by sorting the responses into a symmetrical grid and performing calcu- lations as indicated in Table 1-1.This hypothetical example concerns two tests for the presence (D+) or absence (D–) of disease.
The four cells containing data are commonly identified as follows: Table 1-1. Agreement of two rates on a dichotomous outcome
Test 2 + – Test 1 + 50 10 60 – 10 30 40 60 40 100 + – + a b – c d
Cells aand drepresent agreement (acontains the cases classified as having the disease by both tests,dcontains the cases classified as not having the disease by both tests), while cellsb andc represent disagreement.
The formula for kappa is:
whereρo = observed agreement andρe = expected agreement.
ρo= (a+d)/(a+b+c+d), i.e., the number of cases in agreement divided by the total number of cases.
ρe= the expected agreement, which can be calculated in two steps.First, for cells aandd, find the expected number of cases in each cell by multiplying the row and column totals and dividing by the total number of cases.Fora, this is (60×60)/ 100 or 36; for d it is (40×40)/100 or 16.Second, find expected agreement by adding the expected number of cases in these two cells and dividing by the total number of cases. Expected agreement is therefore:
ρe = (36 + 16)/100 = 0.52 Kappa may therefore be calculated as:
Kappa has a range of 0–1: the value would be 0 if observed agreement were the same as chance agreement, and 1 if all cases were in agreement.There are no absolute standards by which to judge a particular kappa value as high or low; however, many researchers use the guidelines published by Landis and Koch (1977): < 0 Poor 0–0.20 Slight 0.21–0.40 Fair 0.41–0.60 Moderate 0.61–0.81 Substantial 0.81–1.0 Almost perfect
Note that kappa is always less than or equal to the percent agreement because it is corrected for chance agreement.
For an alternative view of kappa (intended for more advanced statisticians), see the sidebar below.
Validity
Validity refers to how well a test or rating scale measures what is it supposed to measure.Some researchers define validation as the process of gathering evidence to support the types of inferences intended to be drawn from the measurements in
κ ρo–ρe 1–ρe --- = κ 0.8–0.52 1–0.52 --- 0.583 = =
Reliability and Validity | 13
Basic Concepts
question.Researchers disagree about how many types of validity there are, and scholarly consensus has varied over the years as different types of validity are subsumed under a single heading one year, then later separated and treated as distinct.To keep things simple, we will adhere to a commonly accepted categori- zation of validity that recognizes four types: content validity, construct validity, concurrent validity, and predictive validity, with the addition of face validity, which is closely related to content validity.These types of validity are discussed further in the context of research design in Chapter 5.
Content validityrefers to how well the process of measurement reflects the impor-
tant content of the domain of interest.It is particularly important when the purpose of the measurement is to draw inferences about a larger domain of interest.For instance, potential employees seeking jobs as computer program- mers may be asked to complete an examination that requires them to write and interpret programs in the languages they will be using.Only limited content and programming competencies may be included on such an examination, relative to what may actually be required to be a professional programmer.However, if the subset of content and competencies is well chosen, the score on such an exam may be a good indication of the individual’s ability to contribute to the business as a programmer.
A closely related concept to content validity is known asface validity.A measure with good face validity appears, to a member of the general public or a typical person who may be evaluated, to be a fair assessment of the qualities under study. For instance, if students taking a classroom algebra test feel that the questions reflect what they have been studying in class, then the test has good face validity.