TEST RELIABILITY - The development of a criterion referenced test of occupational functional re

Introduction

The method of assessment of a test*s reliability depends upon the use to which the test is to be put. If the designer is concerned with the question of whether all the items measure the same thing i.e. the homogeneity of a unidimensional test, he will use a measure of internal consistency, or a split-halves technique. He might develop an equivalent form of the test by identical processes to see if they matched up in the responses of the same pupils to them. He might be concerned with decision-making and wish to assess the reliab ility of the test in suggesting the same decision on more than one occasion for the same testee i.e. the stability of the test. Most criterion-referenced tests are to do with making decisions about the attainments or abilities of individuals or about the efficacy of instructional procedures. In each case, the reliability of that decision is of paramount importance, and the test must contribute to that reliability by being consistent in its recommendations. The testees* scores must tend to be the same on more than one occasion of testing; and this is true whether the test is one that classifies a testee into *master* or ’non-master *, or one that uses individual scores as part of a larger decision-making process. In fact,

Swaminathan, Hambleton & Algina (1974) go as far as defining the

reliability of a criterion-referenced test as ’’the measure of agreement between the decisions made in repeated test administrations’’. Whilst this is not the case for all criterion-referenced tests, it is suffic iently so for the majority of effort in the area of reliability theory to be concentrated on methods for most accurately assessing the true scores of individuals, and hence the reliability over time of the test. Almost uniformly, they involve the test-retest procedure.

Such procedures, however, expose the test materials to the testees on two occasions and are therefore prone to a practice effect,

and to other testee-related effects (such as rejection of the testing ’discipline*, where repeated testing causes the testee to falsify his responses deliberately in reaction to the task). Slower workers may reach items previously omitted due to lack of time; quicker workers may become careless. Small changes in scores that such effects are likely to bring about have particular importance when the range of scores is restricted. In the extreme case, where all high scorers were careless and lost one mark because of it and all low scorers gained one mark through practice, all these changes would produce a low estimate of reliability, when in fact all testees had similar marks on both testing occasions. •

A problem related to this and following on directly from it, is that of homoscedasticity, where the test may have a different level of reliability for different ability ranges. For example, the' slower worker, mentioned above, whose score improves on retesting contributes more to a lower estimation of reliability by changing his score, than does a top-scoring testee whose score stays the same. Further, a test with a ceiling-effeet, with many scores clustering around the higher score range, may have different reliabilities for that cluster

compared with a spread of scores lower down the score range. This

may, of course, be a func+ion of the variability of the two ranges: cluster with low variability and spread with high variability. None theless, these aspects of the procedures for estimating test reliability must be borne in mind.

It is clearly of importance that the Functional Reading Tests, Form A and Form B ( FRT A and FRT B) be reliable over time and that a test-retest assessment will be appropriate. It will not, however, be

are hierarchical in some way can a single score exactly state what the testee can and cannot do - a part of the definition of a criterion referenced test (e.g. Nitko, 1970 p. 653). Davis suggests that the conditions for this to be exactly the case in any test are never met in actual practice (Davis, 1974 p. 45). One method of overcoming this difficulty is to report subtest scores within a heterogeneous test. Brown (1970) indicates that reliability coefficients are often reported for the summative score only, rather than subtests as well. Ganopole (1978, p . 45) suggests, however, that "it is only when reliability estimates are reported for each subtest for which a score is derived, that we can have any assurance that the subtest possesses acceptable reliability". It will be worthwhile to compare and report subtest reliability also.

To pursue the argument a little further, in a heterogeneous criterion-referenced test, comprising several subtests, is it worth while to report a summative score, rather than subtest scores? For, if the real information about a testee*s performance comes from the subtests, of what use is a single figure? Clearly, it is only of value if the subtest scores are themselves related to the total score i.e. to get a high total score, each subtest score must be high also. If this is the case, the summative score will be an indicator of overall performance across subtests. This requires a measure of internal

consistency, not of the item-total type (e.g. Kuder-Richardson 20) which is based on unidimensionality, but the establishment of an acceptable relationship between subtest and total scores. With such a measure, it becomes meaningful to talk about standards and cutoff scores. These are used frequently in criterion-referenced testing, particularly in mastery tests where a common practice is to establish, by some means, a criterion score to determine ’pass* or ’fail* grades for testees. It was the intention of SOFRP to investigate criterion scores empirically, following the arguments against their arbitrary determination advanced by Glass and Smith (1978), discussed in Chapter

4

above, but, as much of the following discussion hinges on the use

of criterion scores, it will be useful to bear in mind that their use

remains a possibility at this stage. The empirical investigation

is discussed in Chapters 14 and 15 below.

It will also be of value to assess the reliability of the estimate of a testee’s performance on functional reading tasks i.e. the ’domain status* of the testee. This is distinct from the stability of the test; there we are considering the reliability of the assignment of the testee to one of two or more mastery states. Here we are considering the confidence limits of the testee’s score on the test.

So, then, three measures of reliability are sought for FRT A and FRT B: a coefficient of stability, or consistency of decision making, for summative scores and for subtest scores; an internal

consistency measure of the relationship of subtest scores to summative score and confidence limits for individual scores. It is to be noted that a high coefficient for test-retest for the summative score will be of little value if there is a poor relationship between the total and the subtest scores.

Estimate of Criterion-referenced reliability

It is in the estimation of reliability that the spectre of low variability in scores looms most oppressively. Popham & Husek

summarise the problem neatly: ’Stability might certainly be important for a criterion-referenced test, but in that case, a test-retest correlation coefficient, dependent as it is on variability, is not necessarily the w*.y to assess it ... If a criterion-referenced test has a high average inter-item correlation that is fine. If the test has a high test-retest correlation, that also is fine. The point is not that these indices cannot be used to support the consistency of the test. The point is that a criterion-referenced test could be highly consistent, either internally or temporarily, and yet indices

dependent on variability might not reflect that consistency (1969, p. 5-6). Popham & Husek do not rule out the use of the usual product-moment

Jackson (1970) contends that in actual practice, there will always be variability in scores and so correlational techniques may be

used. Yet in a situation where the measure itself may be unreliable, it is necessary to have other techniques to hand, to build a corpus

of information upon which to base decisions. The use of the computer

brings even the most complicated technique whithin easy grasp

and there can be little excuse for not considering a range of possib ilities .

Wedman (1974) suggests there are three schools of thought in the area of criterion-referenced reliability estimation: those

applying classical techniques; those reformulating classical techniques for criterion-referenced measurement; and those advocating the standard error of measurement as a technique. A number of techniques are

concerned with pre-insturction testing versus post-instruction testing (e.g. Ivens (1970)

3

Cox and Graham (1966)) and, as this is not applicable to SOFRP, will not be considered here. Below, the latter two schools of ’thought will be discussed, along with other suggestions that have more recently arisen.

Livingston (1972a) reformulated certain norm-referenced concepts for use in criterion-referenced measurement by using deviations from the criterion score, rather than from the mean score, as the equiv alent of variance. This enabled him to develop a special correlation coefficient, of product-moment type, with moments taken around the criterion score rather than the mean. Harris (1972), however, argues that Livingston’s coefficient may be inconsistent with different

ranges of talent: "Livingston’s "bigger" coefficients (than classically derived ones) can readily be secured by implicitly extending the

range of talent" (p. 28-29). Further, he shows that the standard error of measurement for both techniques is the same and concludes that

"his work fails to advance reliability theory for the special case of criterion-referenced testing" (p.29). Although L ivingston’s reply (1972b) answers many of Harris’ criticisms, Shavelson, Block and Ravitch (.1972) suggest that Livingston’s coefficient is a

function of the criterion as well as the individuals* responses to

items. They further conclude that his coefficient does not have the

same meaning as the classical reliability measures and so should not be counted as such. Divgi (1978) notes, from Livingston’s equation for his index:

var (t) + (|i - C ) 2 = p + ( }1 -C )2 /var (X) var (X) + ( [1 - C ) 2 1 + ( [1 - C ) 2 / var (X)

(where T and X are true and observed scores respectively, |i is the mean score, C is the criterion (cut-off) score and p i s the classical coefficient of reliability), that, ’’for any given value of p,

Livingston’s index increases with the distance between the mean and criterion scores, and therefore it can have an appreciable value even when p = 0" (p. 3).

Other coefficients have been suggested or adapted for use in the assessment of criterion-referenced test reliability. Swaminathan, Hambleton & Algina (1974) proposed the use of C o h e n ’s K (Cohen, 1960) as a measure of the agreement in the assignment of testees to mastery states, between two testings, corrected for chance agreement. K is defined as

K = ( p _ p ) / ( l _ p )

o c c

where p^ is the observed proportion of agreement and p^ is the

expected chance proportion of agreement. For two test administrations to the same sample of testees, for any given cut-off score, a figure similar to Figure 12.1 is obtained.

First Administration Master (M) Non- Master (N) Marginal Proportion Second Administration Master (M) _{P MM} _PNM _{P 2M} Non- Master (N) P NM P PNN P 1N Marginal Pr o p . _{P 1M} _{P 1N}

Figure 12.1 : Crosstabulation of Mastery proportions

where p is the proportion in each cell Thus-,

Po PMM + P NN

and P c = (P1M X P 2 M } + (P1N X P 2 N )

Coefficient K has rnuch to recommend it: it is simple to calculate and is intuitively attractive, for it seems most reasonable that the proportion of agreement in classification should be in direct relation to the reliability of the test. There are a number of objections

to its use, however, which should be considered. Reid & Roberts (1978) compared K with the coefficient <|) via a Monte Carlo procedure, for the dichotomous case, and concluded that, in practice, the difference between the coefficients is so small as to suggest that the most easily calculable (c|)) be used instead. Cohen (I960) noted that this would tend to be the case when the marginal proportions of categories

were similar for test and retest. Subkoviak (1978) used four different procedures for estimating reliability of mastery tests. He notes that "the Swaminathan procedure (K) produces unbiased estimates; but it requires two testings and standard errors are

relatively large for classroom size samples" (p. 115). His comparison is with three measures all assuming item homogeneity (the Subkoviak (1976), Huynh (1976) and Marshall-Haertel (1976) estimates) and perhaps this is invidious with a procedure unbiased by the content in its estimations. Divgi (1978) writes that all the above measures are still dependent upon the distributions of ability of the groups tested, and suggests that, therefore, there is still reference to a normative model. He raises the point that if criterion-referenced measurement is to be group-independent, the measure of reliability should also be group-independent also. His coefficient,k , is how ever only for homogeneous item groups and, perhaps, the probabilities

of misclassification given in his Table 2 are unacceptably high.

Further, Swaminathan et al themselves indicate that the coefficient is dependent upon factors that affect the decision process (1974 p. 266). Martuza (1977) writes that and K are dependent on a number of factors associated with the decision process e.g. the value of the cut score (i.e. the score used to classify examinees as masters and non-masters), test length ... the number of alternatives per item if multiple choice items are used, and the homogeneity of the examinee group in which the decisions concerning mastery and non-mastery are being made. As a result, the values of p Q and K are interpretable only when information concerning these factors is available" (p. 280 - 281).

Despite the above criticisms and qualifications, the use of K seems straightforward and appropriate to the circumstances, and its adoption, with an open mind, might prove worthwhile. Studies involving

Millman (1974) has suggested a method for calculating the reliability of estimates of domain status using the mean absolute

difference between scores on two identically generated tests. This

would be inappropriate for a test-retest situation, however, where one might reasonably expect some practice effect to be apparent, increasing the size of the mean absolute difference. It might be argued, however, that there is a difference between testees who answer incorrectly on the first occasion and change to correct on the second, compared with those who change from correct to incorrect. A practice effect, in it gives pupils a combined chance to assess the possible responses to an item, may be assumed to act to cause a group shift towards ’correctness* rather than ’incorrectness*,

for a reliable test. That is, with a test which is inconsistent (due, perhaps, to ambiguity in some items), one might expect any extra time to be just as likely - if not more so - to act to cause a shift to ’incorrectness* as ’correctness*. It therefore will be instructive to examine any shift to ’incorrectness* via a count for each item or over the whole test, of the number of testees changing from correct

₁

on the first administration to incorrect on the second. The lower the count, the happier one can be that strange changes are not taking place within the test (Cf also, Beggs & Lewis, (1975) p. 200).

Other estimates of domain status have used the standard error of measurement, following classical procedures. As Hambleton, Swaminathan & Algina(1976) point out: "Whereas classical approaches to reliability estimation are affected adversely by the homogeneity of scores often obtained with criterion-referenced tests, the standard error of measurement is relatively unaffected " (p. 58), and can thus be used as an estimate of the amount of error associated with the

domain score. Shavelson et al (1972) and K r i e w a l l (1972) both recommend this approach. They seem, however, to make no allowance for severely skewed distributions, where neither assumptions of approximate normality nor of normal or binominal distributions of error may hold. If

one is taking, as SOFRP does, the proportion-correct score as the best

estimate of a testee’s level of performance, it may prove unjust ifiable to employ this method if the distribution of scores is skewed.

From this discussion of estimates of reliability in criterion- referenced measurement, one may select the methods which seem most applicable to SOFRP. Firstly, there can be .no harm in calculating a classical reliability coefficient, for if it is high, it still indicates a reliable test (Popham & Husek, 1969, previously quoted on this matter). As a method of looking at the amount of agreement, independent of the variability of scores, coefficient k may well prove useful, although its high rate of indeterminacy may mean no productive value is obtained. For such purposes, k will be calculated

for each cutting score, for the whole test and each subtest. Thus,

no a priori determination of a cutting score need be made until after empirical investigation, but the data will be available.. As a measure of reliability based on individual scores, rather than

proportions above and below a cutting score, an adaptation of M i l l m a n ’s procedure will be used. For each test item, the number of changes in response ( ’correct* to ’incorrect* and vice versa) will be tallied and these ’Discrepancies* reported. The change from ’correct* to ’incorrect* will be considered as most important. A high proportion of changes

in this direction will indicate an unreliable test. The calculation is performed for items, rather than for testees whose score declines as this gives an indication of which items, if any, are producing a shift to ’incorrectness*.

Assessment of FRT reliability

A test-retest procedure was used to assess the reliability of the two forms of the Functional Reading Test.

Form A (FRT A) was administered to 47 pupils in the first pilot school on two occasions separated by a three week interval.

Form B (FRT B) was administered to 44 pupils in a second pilot school on two occasions separated by a three week interval. This second school was selected after consultation with the Senior Advisor for Research and Evaluation of the Local Education Authority, and was considered to be very similar to the first pilot school in terms of cachement.

Reliability of FRT A

Test-retest results are given in Tables 12.1, 12.2, 12.3, 12.4 and 12.5.

The whole-test reliability of r = 0.86 (Table 12.1) is certainly sufficiently high enough to indicate that FRT A is reliable in

classical terms. Further, all but one of the subtests also have

In document The development of a criterion referenced test of occupational functional reading ability (Page 122-140)