Estimation of Reliability Coefficients Using the Test Information Function and Its Modifications

(1)

229

Information

Function

and

Its

Modifications

Fumiko

_Samajima

University

of Tennessee

The

_reliability

coefficient and the standard error of

measurement in classical test

_theory

are not _properties

of

_{a specific}

test, but are attributed to both a

specific

test and a

_specific

trait distribution. In latent trait

mod-els, or _{item response}

_theory,

the test information

func-tion

_{(TIF) provides}

more

_precise

local measures of accuracy in trait estimation than are available from the

reliability

coefficient. The

_reliability

coefficient is still

widely

used, however, and is

_popular

because of its

simplicity.

Thus, it is worthwhile to relate it to the TIF.

In this paper, the

reliability

coefficient is

predicted

from the TIF, or two modified TIF _formulas,and _a spe-cific trait distribution.

_Examples

demonstrate the

vari-ability of

the

reliability

coefficient across different trait distributions, and the results are

_compared

with

empiri-cal

_reliability

coefficients. Practical

_suggestions

are

given

as to how to make better use of the

_reliability

coefficient. Index terms:

_adaptive

_testing,bias, clas-sical test

_{theory, item information function,}

latent trait

models, maximum likelihood estimation,

reliability

co-efficieno, standard

error

_{of measurement,}

test

informa-tion

_{function, trait}

estimation.

Reliability

and

_validity

coefficients have been

_{widely accepted by psychologists}

and test users as

im-portant concepts

in classical test

_theory

_(CTT).

The

_reliability

coefficient

_{(~°x~2), where X~}

is the test score

and X,

is the retest score, can be

_expressed

as the ratio of the true score variance to the observed test score

variance. _~is

_largely

influenced

_by

the

_homogeneity

or

_{heterogeneity}

of the _groupof examinees to

whom the test was administered as well as

_properties

of the test. In

_spite

of

_this,

many researchers and test

users still treat

_r~~2

as if it were an attribute

_solely

of the test.

In latent trait

models,

or item response

_theory

_(iRT),

the item information function

_(IIF)

and the test

information function

_{(TIF) provide}

measures of the local _accuracyof trait

_estimation,

a

_concept

that is

miss-ing

in CTT. The values of the IIF and the TIF do not

_depend

on the

_specific

_groupof examinees

_tested,

unlike

r,,,, (i.e.,

the IIF and TIF are

_{population-free).}

_Therefore,

_{~°X~2 and the associated standard}error of

measure-ment

_(SEM)

are not as

_important

as

_they

were before IRT

became

feasible.

_However,

because of its

_simplicity

~-y

is still

Thus,

it is worthwhile to relate it to the _TIF,and to

_suggest

better _ways

to use

~X ~2.

L,a~l~y

(1943)

related the normal

_ogive

model and _~XX.He showed _{that r~x%}is obtained from the

1 ~ 1 2

estimated error score variance and the observed test score variance when the trait distribution is

_N(0,1)

and the

_difficulty

_parameters

of the n dichotomous test items distribute

_normally.

Lord

₍₁₉₅₂₎

showed how the discrimination index of a test in CTT at a

_given

trait level

_(which

is

_closely

related to the amount

of test

_information)

is related to _rxx,at a

_specified

trait

_level,

when the

_regression

of test score on 0 level

is

_{approximately}

linear.

Samejima

(1977b)

showed that rlx, can be obtained from the TIF and the trait distribution of a

_target

_population.

Lord

₍₁₉₈₃₎

discussed unbiased estimators of the

_parallel

forms

_rX~2

in the

_{three-parameter logistic}

model. The idea of

_replacing

the CTT

_concepts

of _~°~~and the SEM

_by

the TIF

in IRT is advanced

_by

the

_proposal

of two modified TIF formulas

_(Samejima,

1~990),

which use the bias

(2)

The

_present

_paperuses information from the MLE bias function

(MLEBF)

of a test to

_predict

the

_rX¡X2

and

SEM attributed to a

_specified

examinee _groupto whom a test is to be administered. The results of

_predicting

TX¡X2

using

the

original

TIF and the two modified versions of the TIF also are

_compared.

The TIF and Local Standard Error of Estimation

Let 0 be the latent trait that takes on _anyreal number. Assume that there is a set of n test items

_measuring

0 whose characteristics are known.

Let g

denote such an

_{item, kg}

be a discrete response to item g,

and fl (0)

denote the

_operating

characteristic of

_k~9

or the conditional

_probability

_assigned

to

_Ag,

_given

8;

that

_is,

k,

Assume that

_~(8)

is at least five-times differentiable with _respectto 0. _{The item response information}

function

_{(S~rnejirna9 1972)}

is defined as

and the IIF,

_~(9),

is defined as the conditional

_expectation

_{of ~ (0),}

_{given 8,}

such that

When item _{g is}scored

_{dichotomously,}

the IIF is

_simplified

to

where

_{P (8)}

is the

_operating

characteristic of the correct answer to item _g.This is identical to the item information function

_{proposed by}

~irnba~~m

₍₁₉₆₈₎

_{for the dichotomous response item. Let}V be a

re-sponse

pattern such

that

The

_operating

characteristic,

P,(O),

of the response pattern V is defined as the conditional

_probability

of

V,

given

0,

and

_assuming

local

_{independence (Lord &}

_Nwi~k9

_1968),

The _response

_pattern

information function

_{(Samejima, 1972),}

_{7y(e), is}

_given

_by

The _TIF,

_1(0),

is defined as the conditional

_expectation

_{of 1,,(0),}

_given

_0,

and from

_Equations

_{2, 3, 6,}

and 7,

Again,

the

_relationship

between the IIF and the TIF demonstrated in

_{Equation 8}

is identical to the result Birnbaum

₍₁₉₆₈₎

demonstrated at the dichotomous response level.

The

_reciprocal

_{of the square}root of the TIP,

[7(9)]&dquo;~,

is the

_{asymptotic standard}

deviation of the condi-tional distribution of the MLE of

_0,

_given

its true value. This function

_usually

is used as the standard error

of estimation

_{(SEE) even}

when the number of test items is finite and

_{relatively small.}

Note that the SEE is a function of 0-it is

_locally

defined.

Also,

unlike its

_counterpart

in CTT, the SEE does not

_depend

on a

(3)

specific

group of

examinees,

but is

_solely

a

_property

of the test.

In CTT, the SEE

_gives

the

_impression

that the test

_always

_provides

the extent of error indicated

_by

its value.

Common sense

_{suggests, however,}

that no test can be

_appropriate

for _{every group of examinees. For}

ex-ample,

if a

_simple

arithmetic addition test is administered to

_college

mathematics

_majors,

almost _everyone will

_correctly

answer all the items. In such a case, the test is

_useless,

because the results do not reflect the individual differences _{among the}examinees. If a calculus test is administered to

_elementary

school

_students,

opposite-but

similar-results will occur. Thus each test is effective

_{only locally}

on the 0 dimension. The SEM

of the test should

differ,

therefore,

for _groupsof examinees at different locations on the 0 scale. It is more

appropriate

to consider the error of measurement as a function of

_0,

as IRT models do.

Prediction of the

_e9I~~~II~y

Coefficient the SEM for a

_Specifle

0 Distribution

_Using

the TIF

Using

the TIF, it is

_possible

to link CTT with IRT

_through

the

_prediction

_{of r°x X,}and the SEM for

_{a specified}

8 distribution or a

_specified

_groupof examinees

_{(Samejima, 1977b).}

’ ’

Let

_at

be any estimator of

0,

were £ denotes the error variable. In the test-retest

_situation,

where a

_subscript

1 indicates the test, and a

_subscript

2 indicates the retest. If it can be assumed that in the

test-retest

_situation,

and

then

where

_0*

_representseither

_atl

or

Q~,

and e _representseither _£1or _E,.

Thus,

the correlation between

_8tl

and

_8tz

is

Note that if 8 is

_replaced

_by

one of its transformed forms

_(i.e.,

the true test score

_~’),

and if the observed

test score X is used as the estimator of

T’,

and E is used as its error of

estimation,

then

_Equation

9 can be rewritten as

then

_Equation

15 becomes a familiar

_equation

for the

_reliability

coefficient _fy

^,9

In

_general,

(4)

reduced to

When the MLE of 0 is

_used,

let

_8y

or 6

denote the MLE of 0 based on the response

_pattern

V.

Samejima

(1977a,

1979)

observed that even with a

_relatively

small number of test items the conditional distribution of

_6,

given

0,

can be

_{approximated by}

the normal distribution

_N{8,[7(6)]~}

if two conditions hold. Condition 1

is

that 11

must be

_practically

_{conditionally}

unbiased for the 8 interval of interest. Condition 2 is that

1(0)

must assume

_{reasonably high}

values for that

_specific

interval. The best

_{approximation}

occurs when this

interval covers the _rangeof 8 within which most examinees are located. When this is the case, from

Equation

19,

where y(e)

denotes the

_density

function of 6 for

_{a specific}

_groupof examinees.

Thus,

from

_Equation

_15,

Equation

21 indicates that the

_reliability

coefficient

)r(6j, #~))

can be

_predicted

by a

single

administration of the test,

given 1(0)

and the 0 distribution of the examinees.

A

.

It also has been observed

_(Samejima,

1977b)

that in

_{computerized adaptive}

_testing

_(CAT),

~°(®h ®2~

can

be

_predicted

if a

_specified

amount of test information is used as the

_stopping

rule for

_{a given}

0 level in the

test and retest situations.

_Thus,

where

_7~(6)

and

_~(0)

are the _presetcriterion Tips in the test and retest

_situations,

_{respectively,}

which are

adopted

as the

_stopping

rules for the two

_testing

sessions

_{(a subscript}

1 indicates the test

_situation,

and a

subscript

2 indicates the retest

_situation).

Note that these two criterion Tips need not be the same for the

test _{and the retest,}nor do

_they

need to be constant for all 0. Note also that

_{r(6p 6j}

is obtainable from a

single

test

_{administration,}

because all that is needed is the

_sample

variance

_{of 0,}

to

_replace

Var(ej

in

Equation

22.

_t-(61,62)

will

_change

if the

_preset

criterion TiFs

_[/~(8),

_I~z~(~)9

or

_both]

_change.

For the

simpli-fied case in which the same amount of test information is used as the criterion for

_terminating

the

presen-tation of new _{items for every examinee in each of the}test and retest

_situations,

_{respectively, Equation}

22

can be rewritten as

where

_U2

and

_~2

are the

reciprocals

of the constant amounts of criterion test information in the two

_testing

situations,

respectively.

If a constant amount of test information is used as the

_stopping

rule for every

examinee in both the test and retest

situations,

then

_r(6,,

_6,)

takes the

simplest

form where

CY2

denotes the

_reciprocal

of this common constant amount of test information.

The

_{appropriateness}

of the above normal

_{approximation}

of the conditional distribution of

6,

given

0,

can be examined

_using

the monte carlo method

_{(see Samejima, 1977a).}

A _necessarycondition for this

approximation

is that

6

is

_practically

_{conditionally unbiased,}

that

is,

the

_regression

of

6

on 0 is _veryclose

(5)

intro-duced below. Note that the MLEBF

_together

with the 0 distribution of the

_target

_population

also determines whether the

_assumption

described

_{by Equation}

13 can be

_accepted.

Because the SEM in CTT is for the _{entire group of}

_examinees,

the SEM, Og, of a test attributed to a

_specific

0 distribution can be written as

when Conditions 1 and 2 described above hold.

Assume that the test items have

_already

been

calibrated;

that

is, the

item

_parameters

of each item have been estimated

_following

an

appropriate

mathematical model for

~(8).

This can be done either

separately

.from, or

simultaneously

with,

maximum likelihood estimation of the individual

_parameter

0 for a

nonadaptive

test. In the former

_situation,

_8y

will be obtained for _everyindividual

_{by using}

the estimated item

_parameters.

In CAT, the item

parameters

for the items in the item

_{pool usually}

have been estimated before

_using

the item

_pool

for

_{adaptive testing; using}

these estimated item

_parameters

the individual

parameter

8y

is obtained as a result of

_testing,

which is based on a tailored subset of the item

_pool

for each

examinee.

_Thus,

_{Var(6) and Var(9J in}

_Equations

21 or 22 can be

_computed.

From

_Equations

_{2, 3,}

and

_8,

₁₍₀₎

for a

_nonadaptive

test can be estimated

_using

the estimated item

parameters. The

_{density function,}

_f(O),

can either be estimated

_{(e.g., Samejima, 1981)}

or set a

_{priori if}

prediction

is the purpose, as it is here.

_Thus,

E{[/(6)]’’}

is obtained with

₁₍₀₎

_{replaced by}

the estimated TIP

in

_Equation

_20,

_{approximating}

the area

_by

a number of

_rectangles

of small widths or

following

Simpson’s

quadrature

formula

_{(Elderton &}

_{Johnson, 1969)}

for the

_{approximation}

to the

_integration.

_Using

the result of

Equation

20 and the

_computed

_sample

variance of

_6,

that

_replaces

_Var(6)

in

_Equation

_21,

the

_r)6),6~j

attributed both to the test and to the

_specific

_{examinee group is obtained.}

_Similarly,

the SEM can be

obtained

_{by Equation}

25. For a CAT,

~(®a9 can

be obtained

_{by using}

the

_preset

_{criteria&horbar;7~(6)}

for the test

and

_~)(9)

for the retest-and the

_sample

variance of

_9j

that

_replaced

_{Var( 6j)}

in

_Equation

22.

If conditional unbiasedness is not

_supported

for the 0 interval of

_interest,

_appropriate

modifications for

Equations

21 and 25 are needed. This can be done

_using

the MLEBF

_(Lord,

_1983;

_Samejima,

1997, 1993a,

1993b)

and the modified ’TIFs

_(Samejima,

_1990).

The l~L~~~ and Two Modified TIF Formula

Lord

_{(1983) proposed a}

bias function for the MLE of 0 in the

_{three-parameter}

_logistic

_model;

its

oper-ating

characteristic for the correct response,

_~(0),

is

given by

where

_ag,

_bg,

and

_cg

are the item

_{discrimination,}

item

_difficulty,

and

_guessing

_parameters,

respectively,

and D is a

_scaling

_factor,

which is set

_equal

to 1.7 when the

_logistic

model is used as a substitute for the

normal

_ogive

model. Lord’s bias

function,

denoted

_by

_{~~®9 ®~,~9}

can be written as

In

_Equation

_27,

the bias should be

_negative

when

_&dquo;’i8)

is less than .5 for all the items

_[which

is

(6)

items

_[which

also

_{necessarily happens}

for some

interval,

_{(8H, +00)],}

and in between these values of 8 the

bias tends to be close to 0.0. This is obvious because the last factor on the

right-hand

side of

Equation

27 assumes

_negative

values for some items and

positive

values for

_others,

and their total tends to be close to

0.0,

provided

that the

_bgs

are distributed over a wide _rangeof 0. Lord

₍₁₉₈₄₎

_applied

this MLEBF to an 85-item verbal test and found that the bias was

_practically

0.0 for a wide range of 0.

Samejima

(1987, 1993a, 1993b)

expanded

the MLEBF to include _anydiscrete item responses. The

MLEBF in the

_general

case can be written as

where

_Ak(8)

is the basic function

_{(Samejima, 1969)}

_{for the discrete item response}

_k,,

and

_P&dquo;(0)

and

_#&dquo;(0)

denote

the

first and second

_partial

derivatives

_{of fl (0)}

with

_respect

to

0,

respectively.

For

the

graded

re-sponse model in which item score

_x~

assumes successive

_integers,

0

_through

_mg,

each

_kg

in

_Equation

29 must

be

_replaced

_by

the

_graded

item score

x~.

For a dichotomous response

model,

it can be reduced to the form

where

_g’(0)

and

_Pa’(®)

indicate the first and second

_partial

derivatives of

_~(8)

with

_respect

to

_0,

respec-tively. Equation

30 includes Lord’s bias function in the

_{three-parameter}

_logistic

model as a

_special

case.

Samejima

(1990)

proposed

two modified formulas for the TIF.

_They

both use the MLEBF. One modified version takes the

_reciprocal

of an

_approximate

minimum variance

_bound,

and the other modified version

takes an

_approximate

minimum bound of the mean

_squared

error of the maximum likelihood estimator. The first modified TIF,

Y’(®),

is defined

_by

The first

_partial

derivative of the MLEBF with

_respect

to

0,

which is used in

_Equation

31,

is

_{provided by}

for the

_general

case of discrete item _responses,where

_#[10)

and

_I’(0)

denote the third and the first

_partial

derivatives

_{of ~ (8)}

and

₁₍₀₎

with

_respect

to

_0,

_{respectlvcly.}

For a set of dichotomous

_items,

_Equation

32

becomes

’

where

_It’(S)

indicates the third

_partial

derivative of

_fl(0)

with _respectto 0

_{(see Samejima,}

1987, 1990,

1993a, 1993b).

Equation

31 shows that the

_relationship

between this new function and the

_original

TIF

_depends

on the

first

_partial

derivative of the MLEBF. To be more

precise,

if the

_partial

derivative is

_{positive, 1’(8)}

will be less

than

_1(0);

if it is

_negative,

this

_relationship

will be

reversed;

if it is 0.0

_(i.e.,

if the MLE is

_{conditionally}

(7)

The second modified TIF,

3(8),

is defined

_by

The difference between

_1’(8)

and

_{3(8) (Equations}

31 and

_34,

_{respectively),}

is the second and last term in

the braces of the

_right-hand

side of

_Equation

34. Because this term is

_{non-negative throughout}

the entire range of

0,

S(8) :5

1’(8) regardless

of the

_slope

of the MLEBF.

When the MLEBF of a test is

_{monotonically}

_increasing,

_Equations

31 and 34 show that

_T(0)

and

₃₍₈₎

will never be

_larger

than

_1(0).

In this

_specific

case,

S(8) :5

1’(8) :5

1(0) throughout

the entire range of 8.

~°(~i9 ®~)

and the SEM When Conditional Unbiasedness of the MLE of 0 Does Not Hold

When

_practical

conditional unbiasedness of the MLE of 0 does not exist-that

_is,

_B(o;

_8y)

is not

approxi-mately

0.0 for all values of 0 in the interval of

_{interest-T(0)}

or

_E(9)

should be substituted for

_{/(6) in}

Equations

21 and 25.

Thus,

Equation

21 can be rewritten as

or

In

_{theory, E(0) is}

more

appropriate,

but in _manycases

_{discrepancies}

between

_T(6)

and

_E(6)

are small

_(see

Samejima,

1990), so T(6)

can be a

good

substitute for

_{~,(®). ~ls®9}

_{in CAT,}

_T(6)

or

_E(6)

can be used as the

stopping

rule in

_place

of

_{I(0), ~nd}

_{Equation 22}

can be revised into the forums

or

In the same _way,the two modified sEt~~9

_and

_{~£,Z3 ~hich}

are attributed to a

_{specific distribution}

of

_8,

can be rewritten as

and

Equation

29 shows that

_~e;6y)

can be estimated if each item has been calibrated and the estimated

_{fl (0)}

has been obtained.

_Thus,

_using

_{this estimated MLEBF,}

_T(6)

and

_E(6)

are estimated

_using

_Equations

3 and

34,

respectively.

Using

these two modified TIFs, both

_{r(6,, 6~}

and the SEM can be obtained

_{using Equations}

35, 36, 39,

and

40,

in a similar manner as was described when

₁₍₀₎

was used.

p~~

Applications

TIFs and Predicted

r(6,, 6,)s

and SEMs

Method. Six distributions of 0 were

_{hypothesized.}

Predictions were made for each of the six 0 distri-butions about the

_r(6,,6,)s

and SEMs that would be obtained for a

_hypothetical

test

_using

the three different

(8)

TIF formulas. From

_Equations

21 and

_25,

the

r°(~,, ®Z)

and SEM were obtained

_using

_{1(8); fromAE9uations}

35 and

_{39, the}

~(®,9 62)

and SEM were obtained

_using

_T(8);

and from

_Equations

36 and

_{40, the}

r(®1, (2)

and SEM

were obtained

_{using ~(0). T’he}

six

_hypothetical

distributions of 0 were

_normally

distributed with different means and standard deviations: Distribution

_1,

_{N(O, 1);}

Distribution

_2,

_N(-.8,

_1);

Distribution

3,

~1(0, .5)9

Distribution

4,

N(-.8, .5);

Distribution

5,

h1(-i.6, .S)9

and Distribution

6,

N(-2.4, .5),

respectively. Figure

1 shows the

_density

functions of these six distributions of 8.

Figure

1

Density

Function of Six

_Hypothetical

Distributions of 0:

[From

Samejima (1994). Reprinted by

Permission of Kluwer Academic

Publishers]

The

_hypothetical

test consisted of 30

_equivalent

dichotomous

_items,

which followed the

_logistic

model

(Equation

26)

with

_{a -}

_{1.0, ~ -0.0, c~ = 0.0~}

and I~ =1.7. This

_particular

_hypothetical

test was selected

because the interval of 0 for which

_practical

conditional unbiasedness of the MLE

By,

_given

_9,

holds was

expected

to be small because of the common

difficulty

_parameter

for all the

items; therefore,

the

discrep-ancies between

_T(0)

or

_E(0)

and

1(0)

were

_expected

to be

_large

for _{a wider range of 0 than those for a}more

typical

test. This choice was made for the _purposeof

_comparing

the

_predictions

of

_r(b,,,,9,)

for a

_specific

distribution of 0 when

₁₍₀₎

was used versus

_T(8)

or

_E(0).

Another reason for this choice was considerations in CAT. In CAT,

except

for the initial few items

presented

to an

examinee,

the tailored subset of items selected from the item

_pool

consists of

nearly

equivalent

items. If the same level of accuracy in

estimating 6

for all examinees is

desired,

for

example,

then it is reasonable to use a

_{single specified}

amount of test information as the criterion in the

stopping

le. In so

_doing,

the choice between

₁₍₀₎

and

_T(6)

or

_will

make a substantial

_difference,

_especially

for _{examinees with very}

_high

levels of 0 _{and for examinees with very low levels of}

0,

because in _many

cases the item

_pool

will lack

_extremely

difficult and

_extremely

_easyitems.

Results. The MLEBF

of

this

_hypothetical

30-item test is shown in

_Figure

2. Note that outside the interval of 0

_(®I.09

_1.0)

the amount of bias becomes

_{increasingly large.}

The square roots of the TIFS

_[1(0),

T(6),

and

_{E(0)] are}

shown in

_Figure

3.

Tables I and 2

_present

the

_predicted

_r(6,,62)s

and SEMs for the six different distributions of

_0,

(9)

Figure 2

MLEBF of the

_{Hypothetical Test}

of 30

_Equivalent

Items

equal

width of .05 and

_{using a}

number of

rectangles.

In Table

1,

the mean and standard deviation

(SD)

of 0 for each of the six distributions also are

_given.

These SDs are

_slightly

different from the squares of the

second

_parameters

of the normal distributions-. 9915 7 versus 1.00000 for Distributions 1 and

_2,

and

.50155 versus .50000 for Distributions

3, 4, 5,

and

6,

_{respectively,}

whereas all of the means are the same as the

_hypothesized

normal distributions. The

_{discrepancies}

in sl~s are a result of

_using

the

_rectangle

method,

which uses the

_density

of each of the

_equally

_{spaced points}

of 0 as one side and the

_{step width,}

.05,

as the

other,

in order to

_approximate

the normal distributions.

Table 1_ shows that the

_predicted

r(6j, 62)

obtained

_{by using}

₇₍₈₎

was

_widely

_distributed;

that

_is,

it

_ranged

from .20049 to .89641.

fi8p 9~j

decreased as the mode of the distribution shifted from a range of 0 in which

the amount of test information was

_greater

_(e.g.,

Distributions 1 and

₃₎

to _{another range in which it}was

lesser

_(e.g.,

Distribution

_6).

The reduction was more

_conspicuous

when the SD of the normal distribution

was smaller

_(compare

Distributions 1 and 2 versus Distributions 3 and

4).

The

_predicted

_{~°~®~, ~2)s}

orb-tained

_{using T(8)}

indicated a substantial reduction from those

using l(B)

for each of the six distributions of 0. For Distribution

_2,

_r(6,,62)

decreased from .82324 to

_.264’~99

for Distribution 5 from .4~71 ~ to

.21681;

and for Distribution 6 from .20049 to .01182. The reduction is more

_conspicuous

for

Distribu-tions

2, 5,

and

6,

which were distributed on lower levels of 6 where the

_{discrepancies}

between

_I(B)

and

T(8)

were

_large.

_Among

the six

distributions,

the

_predicted

_r(6,,62)

obtained

using T(6)

varied from .01182 to

_.80074,

_showing

even

_larger

differences. Similar results were obtained for the

_predicted

?i6,, 6~js

s

using E(6);

these

r(6,,62)s

varied from .01109 to .79920.

_Also,

for each distribution the reduction in

~~~~9 ®2)

from that obtained

_{by Equation}

35 was

_relatively

small,

as

_expected

from

_Figure

3. Similar results were obtained for the sems, in the reverse

_order,

as shown in Table 2.

In CTT, the SEM, GE, is

given by

(10)

Figure

3

Square

Roots of

_{I(B) (Solid Line), ~’(0) (Dashed Line),}

and

_{See) (Dotted Line)}

for the

_Hypothetical

30-Item Test

[From Samejima (1994). Reprinted by

Permission of Kluwer Academic

Publishers]

obtainable

_{by Equation}

41

_using

the attributed

_i-(61,62)s

in Table 1 in

_place

of

f~

in

Equation

41 and the

corresponding

SEMs, which were obtained

_{by Equations}

_{25, 39,}

and 40. For

_{example, using}

the values of

r(6j,

02

obtained for Distribution 1 from the different TIF formulas

_{[1(0), T(0).,}

and

_these

results were

.31914, .46453,

and

_.47936,

_{respectively;}

for Distribution 3

_they

were

_{.21433, .22388,}

and

_.22475;

and for

Distribution 6

_they

were

.44846, .49857,

and .49876. There are differences in the

_degree

that this set of three

values differ from the

_{corresponding}

set in Table 2. These different

_degrees

of

disagreement

may be

ex-pected,

for the

_degree

of violation from the

_assumptions

behind CTT was different for each distribution of 0. The three

_predicted

error variances of the MLE of 6 arc

_presented

in Table

_2,

for each of the six

hypo-thetical distributions.

_They

were obtained

_{using Equation}

20,

and

_by

similar

_equations

in which

₁₍₀₎

was

replaced

by

T(0)

or

_{respectively.}

0 was divided into small intervals of .05

width,

and a number of

rectangles

were used for

_approximate

_integration.

_{Simpson’s quadrature}

formula

_{(Elderton &}

_Johnson,

1969)

also could have been used and

_perhaps

would have

_provided

more accurate results.

Predicted

_{Versus Empirical}

Reiiabilities

Nlethocl. In order to evaluate the

_resulting

_predicted

_{r~®1, 6~)s}

obtained

_{by using 1(0), T(9),}

and

_E(6),

a set of

_empirical

_{r(6,, ®2)s}

for the six 0 distributions were used as criteria based on simulated data.

Follow-Table 1

Obtained Mean and SI~s of 0 and Predicted

_{r 6,,}

_6,)s

for Each of the Six Distributions of 0 _{Using 1(0),}

_Y°~®),

and

_~,(®)

(11)

Table 2

Predicted SEMs and Theoretical Error SDs for Each of the Six 8 Distributions

Using I(e), ~(~~,

and

~(®~

ing

each of the six distributions of

0,

a _groupof examinees was

_{hypothesized. A}

_response

_pattern

of each

hypothetical

examinee was

_produced

for the test and retest situations

_using

the monte carlo method. Because the test consisted of 30

_equivalent

dichotomous test

_items,

the number-correct

_(NC)

test score was a sufficient statistic for the _response

pattern, and

the MLE of 0 was obtained as a one-to-one

_mapping

of this sufficient statistic

_{(Lord &}

_{Novick, 1968,}

_p.

_429).

There were

1,998

hypothetical

examinees for Distributions 1 and

2,

and

2,004 for

Distributions

_{3, 4, 5,}

and 6.

A

_problem

arose as to how to deal with the --s and +as obtained as the MLEs of 0, before the corre-lation coefficient between the two sets of

_b,s

could be

_computed.

Table 3 _presentsthe

_frequencies

of these two extreme values in the test and retest situations

_separately,

and of those in both

situations,

for each of the six distributions of 0.

_Although

there were

_only

three -ocs in the retest and no other --s or

+ms for Distribution

_3,

more than half of the examinees in Distribution 6 obtained -- as their MLE of 0

in the initial test as well as in the retest, and more than one-third of the total examinees obtained -a in

both. As is common

_practice,

the --s and -~--s were

_{replaced by arbitrary single}

values of -2.65 and

2.65,

respectively,

and the correlations were

_{computed using}

these values.

Table 3

Frequencies

of --s and +°°s Obtained as MLEs of 6 in the Initial Test

and the tZetesi, and in _Both,for Each of the Six Distributions of 0

~e~~rlts. Test-retest

_reliability

correlations,

together

with the two means, the two

variances,

and the

covariance,

are

_presented

in Table 4. These results were

_compared

to the

_predicted

_{r(o,, 6~)s}

in Table 1. For

Distribution

_3,

for which

_only

three

_replaced

values were used

_(see

Table

_4),

the

_empirical

_r(6,,6.,)s

were

very close to the

_predicted

_values;

that

_is,

.80724 versus .81738

_[1(0)],

.80074

_[T(9)],

and .799’

-Note, however,

that in

_general

the

_empirical

_{1-(6-1 61)}

became

_larger

than the

_predicted

value as the total

frequency

of the --s and +°°s increased. This enhancement is

_{artificial, however,}

because

_{by using}

-2.65

and 2.65 in

_place

of -- and +00,

respectively,

those who obtained -- as their MLE of 0 both in the test and

retest situations were treated as if

_they

obtained the same estimated 0 in the two

_testing

_situations,

even

though

the distance between the two &horbar;~s could be

_{infinitely large,}

for

_example.

The same

_logic

_applies

for

(12)

(see

Tables 1 and

_{4). For}

Distribution

4,

the

_frequencies

of +- were 0 in the test, the retest, and the combi-nation of

_both,

and the

_frequencies

of -00 were as small as 56 and 45 in the test and retest

_situations,

respectively,

with

_{only 14}

individuals

_overlapping

in both

_(i.e.,

the second smallest

_frequencies

next to

those for Distribution

_3).

Table 4

Empifical r(6~, 6~)s

for Each of the Six Distributions of 0 Based on the MLEs of the

_Sample

Examinees in the Test-Retest Situation

_{Using Arbitrary}

Values of ~-2.65 and ±2.12271 for Infinite Estimate Values and NC Scores,

and Mean and Variance of ±2.65 Scores at Test and I2etest and Their Covariances

The

_empirical

_{a~(~1, 6,)s}

also

_depended

on the

_replacement

values used for -- and _+-,

especially

when

many examinees received -c- or +0° as their MLE of 0

_(compare

Distributions 5 and 6 to Distributions 3 and

_4).

Because the two

_replacement

_values,

-2.65 and

2.65,

that were used in

_computing

the

_empirical

~°(9~, ®~)s

presented

in Table 4 were

_arbitrary,

the

_empirical

r(&,, 6.,)s

were

_computed

again by

_changing

these two

_replacement

values to --2.122‘~1 and

2.12271,

respectively.

The results also are shown in Table 4.

_Comparing

these values with those

_presented

in the 2.65 column of Table

4,

each of the values for the

empirical

_{r(b,, 6~)s}

is _greaterthan the

_{corresponding}

value for 2.65.

_Although

the increment is almost 0.0 for Distribution 3 and it is mild for Distribution

_4,

it is

_{substantially}

_large

for Distributions 5 and 6. These

results are

_predictable

from the differences in the number of

_replacement

values used for the different

distributions

_(compare

the

_frequencies

of --s and +-s for these distributions in Table

_4).

The

_variability

that exists _amongthe

_empirical

_{-r(®1, ®Z)s}

across different distributions of 0

_might

have been the result of

_using

_6~

rather than the 1~C score,

although

this

interpretation

is

illogical.

This

is not the case, as is obvious from

_theory.

To illustrate

_this,

the

_empirical

_r~salso were

_computed

using

the NC for each distribution of 0. The results in Table 4 show the same

_type

of

_variability

in the

empirical

_{r(6j, Q~s}

across the six 0 distributions as was observed when

9y

was used. The

_{empirical r,,’2S}

s based on NC were

_slightly

_higher

than those obtained

_using

^ov

and 2.65 or 2.12271 in

_place

®f-as and

+-s. This results from the fact that 0 and 30 were used for the two indeterminant scores and

artificially

enhanced the correlations. A set of

_r(6,,62)s

based on

e

could be

produced

that are even closer to those based on the NC score

_by

_adjusting

the

_replacement

values for --s and +~s.

When

_{a subgroup}

of examinees who obtained the same NC score other than 0 and n is

considered,

it can

be taken as an indicator of some sort of

_homogeneity

among these in the

subgroup.

When the test score is either of these two extreme

_{values, however,}

this

_type

of

_homogeneity

cannot be

_assumed,

because

_giving

0

or n as the test score

_simply

means that the test has failed in

_{discriminating}

these

examinees’

0 levels.

Be-cause

_{the r,,,,,}

based on _NC,or of the

_reliability

coefficient based on

6v

with an identical

_replacement

value

for each oo,

treats

each of these two

_subgroups

of examinees as if

_they

had the same level of

0,

the

_empirical

n6p

6,,)s

tend to be

_higher

than

_they

_actually

are,

especially

if there are _{many Os}or ns, or both. This

_explains

the differences between the

_empirical

_{r(6,, 6,)s}

in Table 4 and the

_predicted

_{~°(®,, 92~s}

in Table 1.

Estimating Upper

Bounds of Empirical

Reliabilities

(13)

not be used in

_obtaining

the

_empirical

r(6,, ê2)s

that are to be used as the criterion to evaluate the three

predicted

r( 81, (2)S.

Because there is no reasonable way to handle --s and +-s, several

_attempts

were focused on

_obtaining

an estimate of the upper bound of the

_empirical

~°~®I, 6~)

for each 8 distribution.

_Thus,

in Method I the

_resulting

_r(6,,

(2)S

were obtained

_by:

1.

_Assigning

a

_{randomly produced}

true 0

_following

the

_density

function

where

~a=30,

to each examinee whose MLE of 0 was -00. Then the amount of

_bias,

_{B(0; 6,),}

was added to

the 0 thus

_assigned,

and then the error _score,

_{produced by}

_{randomly following}

N

_{fO, was}

_added,

and the

_resulting

value was substituted for the --.

2. This same _processwas followed for each examinee whose MLE of 0 was _+~;

_however,

the

_density

function in

_{Equation 42}

was

_{replaced by}

An identical value ®f tr~e 6 was

_assigned

in the test and retest to an examinee who obtained -- or +00 in

both

_testing

situations. The estimate of the _upperbound of

_{r(6,, &J}

for each of the six distributions of 0 is

_presented

in Table 5

_{(Method 1 ).}

Because of the way these substitute values for -cos and +--s were

produced,

it was

_expected

that the results would be _very_{conservative upper bounds}

f~r r~®1, ®~~.

Table 5

Estimated

Upper

Bounds of

_r(6&dquo;6,)

_by

Methods 1 and 2 for

Each of the Six Distributions of 0 Based on the h4LEs of

Sample

Examinees in the Test-Retest Method

_{_}

In Method

2,

the results were obtained

by using

a truncated normal distribution to

_{generate a}

true

9;

that

_{is, the}

_population

normal

_density

function was divided

_by

vertical lines into five

_segnents

whose

areas were

_proportional

to the

_frequencies

of:

₍₁₎

those who obtained -- on both the test and _retest,

₍₂₎

those who obtained -- in the test

(or

retest)

situation

only,

(3)

those who obtained finite values for their

MLEs of 0 in the test

_{(or retest) situation, (4)}

those who obtained +°o in the test

_{(or retest)}

situation

_only,

and

₍₅₎

those who obtained +c- on both the test and retest. The ordinate in each segment was divided

by

the total area of the

_segment

to become densities.

Again,

an identical value of true 0 was

_assigned

in both the test and retest situations to an examinee who obtained -- or +- in _{both sessions. Because of the way} in which these substitute values for --s and +oos were

_produced,

it was

_expected

that the results would be even more conservative than those of Method 1.

The estimates of the _upperbound of

_{n6p 6,)}

for each of the six distributions of 0 for Method 2 also are

presented

in Table 5. The _{estimates of the upper bounds of the}

82)s

increased

substantially

over the

results obtained

_by

Method 1 when the data included a substantial number of --

(14)

substitute values would be used

_{(Distribution 6).}

This is because of how the true value of 9 was

_assigned

in Method

_2;

it made the

_variability

in the substitute values for -- _toosmall.

Although

the results of Method 1

_{provide legitimate}

_upperbounds of the

_empirical

_{r(bl, 62)}

for the

different 0

distributions,

it is obvious from

_theory

that

_they

are still far too conservative.

Thus,

procedures

similar to the above two methods were followed without

_assigning

an identical true 0 to an examinee who obtained -- or +00 in both the test and retest situations. The results also are shown in Table 5

(Method

1

Modified and Method 2

_Modified).

These two sets of results were similar to each other. Note that the two

values for each distribution of 0 were between the

_predicted

_r(6,,

6,)s

obtained

by

Equations

21

_[1(0)]

and 35

_[T(9)],

_{respectively (see}

Table

_1).

Discussion Practical

_pt~s

The above

_examples

were based on a

_hypothetical

test of 30

_equivalent

_items,

which

provided

large

discrepancies

between the

_original

TIF and the two modified TIF formulas.

However,

this test is not the

kind of test

_usually

used in

_{practice. Figure}

4 _presents_{the square}roots of the

_original

TIF and its two

modification formulas for two

_empirical

tests-the Iowa Level 11

_Vocabulary

Subtest

_(Figure

_4a),

which

consisted of 43

_{dichotomously}

scored

items,

and Shiba’s Word/Phrase

Comprehension

Test .11

_(Figure

4b),

which consists of 54

_{dichotomously}

scored items

_(see

_Samejima,

_1993a,

_1993b).

Figures

4a and 4b show that

₍₁₎

the three curves are flatter than those of the 30

_equivalent

test

_items,

(2)

the decrease in the amount of information is not as radical as 0

_departs

from the modal

_point,

and

₍₃₎

the

_{discrepancies}

between the _squareroot of the

_original

TIF and those of the modified formulas are not as

conspicuous

(Figure

4 versus

_{Figure 3).}

However,

the two curves for the modified formulas are almost

overlapping,

as was observed with the 30

_equivalent

test items

_{(Figure 3).}

Tables 1 and 5 showed

that,

although

all three

_predictions

of

_r(6&dquo;6,)

were accurate when the distribu-tion of 0 was in the interval of 0 in which the amount of test information was

_large,

when the 0

distribu-tion shifted _awayfrom this interval

_Equation

35 or

_Equation

_36,

in which

₁₍₀₎

in

_Equation

21 is

_replaced

by I(8)

or

_{E(0), respectively,}

were better

_predictors

than

_Equation

21. Thus the MLE bias function is useful in

_predicting

_{r(ol, 6,).}

_Considering

that the values in Table 5 are _upperbounds of the

reliability

coeffi--cients,

Equation

36 in which

_{E(0) is}

used will be the most

_appropriate

formula.

Thus the results of the

_present

research _suggestthat it is advisable to use

_E(9)

rather than

₁₍₀₎

for

predicting

the

_reliability

coefficient of a test attributed to a

_specific

distribution of

_8,

as well as the

crite-rion in the

_stopping

rule of a CAT. With a finite number of test items the MLE Of 0 is

_{conditionally biased,}

given

0.

_Therefore,

in

_theory

the use

_{of E(0),}

which is based on the minimum bound of the mean

_squared

error of

Qy

rather than the minimal variance bound

_{(Samejima, 1990),}

is more reasonable than that of

T(9)

also,

although

the two results

_provided

similar values here.

These

_examples

were selected

intentionally

to _{make the differences among the different}0 distribu-tions _{and among the three}

_predicted

_{~~®,9 ~Z~s}

for each 0 distribution

_{substantially large, using}

equiva-lent test items. Because

_equivalent

test items are seldom used in actual tests, the differences between

the

_resulting

_predicted

_reliability

coefficients obtained

_by

_{using 1(0)}

and

_{by using}

either

_T(9)

or

_E(6)

arc

_expected

to be less.

Because of more useful and informative measures like the TIF and its two modified

_formulas,

the

reliability

coefficient of a test is no

_{longer important}

in modern test

_theory.

It is

_interesting,

however,

to

predict

the

_reliability

coefficient-which is attributed to _separate_{groups of}

_{examinees-using}

the Tips. The traditional

_concept

of test

_reliability

is

misleading,

because the

reliability

coefficient of the same test can

be

_drastically

different for different _groupsof examinees.

If the

_reliability

coefficient must be

used,

a

_{practical suggestion}

is to

_compute

the

_predicted

(15)

Figure

4

Square

Roots

_{of lee) (Solid}

Line),

T(0) (Dashed Line),

and 2(e)

(Dotted Line)

for Two Tests

_Following

the

Logistic

Model

which the test is

_likely

to be administered. In this way, one of the

_reliability

coefficients can be

_selected,

depending

on the _groupfrom which the examinees

_being

tested have been

_sampled.

This will

_improve

the current use of the

_reliability

coefficient in which a

_single

value is recorded in the test manual as the

reliability

coefficient and is used with no consideration of the

_population

from which the

_sample

of examinees have been selected.

(16)

References

Birnbaum, A.

_(1968).

Some latent trait models and their

use in

_inferring

an examinee’s

_ability.

In F. M. Lord & M. R. Novick, Statistical theories

_{of mental}

test

scores

_{(pp. 397-479). Reading}

MA:

_{Addison-Wesley.}

Elderton,

W. P., &

_Johnson,

N. L.

_{(1969). Systems of}

fre-quency curves. London:

Cambridge University

Press.

Lawley,

D. N.

_(1943).

On

problems

connected with item

selection and test construction.

Proceedings of the

Royal Society of Edinburgh,

61A, 273-287.

Lord, F. M.

_(1952).

A

_theory

of test

_scores.

_Psychometric

Monograph,

No. 7.

Lord, F. M.

_(1983).

Unbiased estimators of

_ability

pa-rameters, of their _variance,and of their

_{parallel-fonns}

reliability. Psychometrika, 48,

233-245.

Lord, F. M.

_{(1984, April). Technical problems}

_arisingin

parameter estimation.

_{Paper presented}

at the annual

meeting

of the American Educational Research

Asso-ciation, New Orleans.

Lord, F. _M.,& Novick, M. R.

_(1968).

Statistical theories

of mental

test scores.

_Reading

MA:

_{Addison-Wesley.}

Samejima,

F.

_(1969).

Estimation of

_{ability using}

a

re-sponse pattern of

_graded

scores.

_{Psychometrika}

Mono-graph,

No. 17.

Samejima,

F.

_(1972).

A

_general

model for

_{free-response}

data.

_{Psychometrika Monograph,}

No. 18.

Samejima,

F.

(1977a).

Effects of individual

_optimization

in

_setting

boundaries of dichotomous items on

accu-racy of estimation.

Applied Psychological

Measure-ment, 1, 77-94.

Samejima,

F.

_(1977b).

A use of the information function in tailored

_{testing. Applied Psychological}

Measure-ment, 1, 233-247.

Samejima,

F.

_{(1979). Convergence of the}

conditional

dis-tribiatton

_of

the maximum likelihood estimate, given

latent trait, to the _asymptotic

_normality:

Observations made

_through

the constcant

_information

model

(Office

of Naval Research

_Rep.

No.

_79-3).

Knoxville:

Depart-ment of

_{Psychology, University}

of Tennessee.

Samejima,

F.

_{(1981). Efficient}

methods

_{of estimating}

the operating characteristics

_{of item}

_response_categories and

_{challenge to}

a new

_modelfor

the

_{multiple-choice}

item

_(Office

of Naval Research Final Report of

N00014-77-C-0360).

Knoxville :

_Department

of

Psy-chology, University

of Tennessee.

Samejima,

F.

_(1987).

_{Bias function ofthe}

maximum like-lihood estimate

_{of ability for}

discrete item _responses

(Office

of Naval Research

_Report

No.

_87-1).

Knox-ville:

_Department

of

_{Psychology, University}

of Ten-nessee.

Samejima,

F.

_{(1990). Modifications of the}

test

informa-tion

_{function (Office}

of Naval Research

_Report

No. 90-1). Knoxville: _Departmentof

_Psychology,

Univer-sity

of Tennessee.

Samejima,

F. (1993a).

An

_{approximation}

for the bias func-tion of the maximum likelihood estimate of a latent variable for the

general

case where the item responses

are discrete.

Psychonietrika,

58, 119-138.

Samejima,

F. (1993b).

The bias function of the maximum likelihood estimate of

_ability

for the dichotomous

re-sponse level.

Psychometrika,

58, 195-209.

Samejima,

F.

_(1994).

Roles of Fisher type information in latent trait models. In H.

_{Bozdogan (Ed.),}

Proceed-ings

of the first US/Japan conference

on

_{the frontiers}

of statistical modeling:

An

_{informational approach}

(pp. 347-378).

The Netherlands: Kluwer.

Acknowledgments

This research was

_{supported by}

the Office of Naval

Re-search Contracts N00014-87IK-0320 and N00014-90-J-1456.

Author’s Address

Send _requestsfor

_reprints

or further information to Fumiko

Samejima,

310B Austin

Peay Bldg., University

of Ten-nessee, Knoxville TN 37996-0900, U.S.A. Internet: