229
Information
Function
and
Its
Modifications
Fumiko
Samajima
University
of Tennessee
The
reliability
coefficient and the standard error ofmeasurement in classical test
theory
are not propertiesof
a specific
test, but are attributed to both aspecific
test and a
specific
trait distribution. In latent traitmod-els, or item response
theory,
the test informationfunc-tion
(TIF) provides
moreprecise
local measures of accuracy in trait estimation than are available from thereliability
coefficient. Thereliability
coefficient is stillwidely
used, however, and ispopular
because of itssimplicity.
Thus, it is worthwhile to relate it to the TIF.In this paper, the
reliability
coefficient ispredicted
from the TIF, or two modified TIF formulas, and a spe-cific trait distribution.
Examples
demonstrate thevari-ability of
thereliability
coefficient across different trait distributions, and the results arecompared
withempiri-cal
reliability
coefficients. Practicalsuggestions
aregiven
as to how to make better use of thereliability
coefficient. Index terms:
adaptive
testing, bias, clas-sical testtheory, item information function,
latent traitmodels, maximum likelihood estimation,
reliability
co-efficieno, standard
errorof measurement,
testinforma-tion
function, trait
estimation.Reliability
andvalidity
coefficients have beenwidely accepted by psychologists
and test users asim-portant concepts
in classical testtheory
(CTT).
Thereliability
coefficient(~°x~2), where X~
is the test scoreand X,
is the retest score, can beexpressed
as the ratio of the true score variance to the observed test scorevariance. ~ is
largely
influencedby
thehomogeneity
orheterogeneity
of the group of examinees towhom the test was administered as well as
properties
of the test. Inspite
ofthis,
many researchers and testusers still treat
r~~2
as if it were an attributesolely
of the test.In latent trait
models,
or item responsetheory
(iRT),
the item information function(IIF)
and the testinformation function
(TIF) provide
measures of the local accuracy of traitestimation,
aconcept
that ismiss-ing
in CTT. The values of the IIF and the TIF do notdepend
on thespecific
group of examineestested,
unliker,,,, (i.e.,
the IIF and TIF arepopulation-free).
Therefore,
~°X~2 and the associated standard error ofmeasure-ment
(SEM)
are not asimportant
asthey
were before IRTbecame
feasible.However,
because of itssimplicity
~-y
is stillpopular
among test users.Thus,
it is worthwhile to relate it to the TIF, and tosuggest
better waysto use
~X ~2.
L,a~l~y
(1943)
related the normalogive
model and ~XX. He showed that r~x% is obtained from the1 ~ 1 2
estimated error score variance and the observed test score variance when the trait distribution is
N(0,1)
and thedifficulty
parameters
of the n dichotomous test items distributenormally.
Lord(1952)
showed how the discrimination index of a test in CTT at agiven
trait level(which
isclosely
related to the amountof test
information)
is related to rxx, at aspecified
traitlevel,
when theregression
of test score on 0 levelis
approximately
linear.Samejima
(1977b)
showed that rlx, can be obtained from the TIF and the trait distribution of atarget
population.
Lord(1983)
discussed unbiased estimators of theparallel
formsrX~2
in thethree-parameter logistic
model. The idea ofreplacing
the CTTconcepts
of ~°~~ and the SEMby
the TIFin IRT is advanced
by
theproposal
of two modified TIF formulas(Samejima,
1~990),
which use the biasThe
present
paper uses information from the MLE bias function(MLEBF)
of a test topredict
therX¡X2
andSEM attributed to a
specified
examinee group to whom a test is to be administered. The results ofpredicting
TX¡X2
using
theoriginal
TIF and the two modified versions of the TIF also arecompared.
The TIF and Local Standard Error of Estimation
Let 0 be the latent trait that takes on any real number. Assume that there is a set of n test items
measuring
0 whose characteristics are known.
Let g
denote such anitem, kg
be a discrete response to item g,and fl (0)
denote the
operating
characteristic ofk~9
or the conditionalprobability
assigned
toAg,
given
8;
thatis,
k,Assume that
~(8)
is at least five-times differentiable with respect to 0. The item response informationfunction
(S~rnejirna9 1972)
is defined asand the IIF,
~(9),
is defined as the conditionalexpectation
of ~ (0),
given 8,
such thatWhen item g is scored
dichotomously,
the IIF issimplified
towhere
P (8)
is theoperating
characteristic of the correct answer to item g. This is identical to the item information functionproposed by
~irnba~~m(1968)
for the dichotomous response item. Let V be are-sponse
pattern such
thatThe
operating
characteristic,
P,(O),
of the response pattern V is defined as the conditionalprobability
ofV,
given
0,
andassuming
localindependence (Lord &
Nwi~k9
1968),
The response
pattern
information function(Samejima, 1972),
7y(e), is
given
by
The TIF,
1(0),
is defined as the conditionalexpectation
of 1,,(0),
given
0,
and fromEquations
2, 3, 6,
and 7,Again,
therelationship
between the IIF and the TIF demonstrated inEquation 8
is identical to the result Birnbaum(1968)
demonstrated at the dichotomous response level.The
reciprocal
of the square root of the TIP,[7(9)]&dquo;~,
is theasymptotic standard
deviation of the condi-tional distribution of the MLE of0,
given
its true value. This functionusually
is used as the standard errorof estimation
(SEE) even
when the number of test items is finite andrelatively small.
Note that the SEE is a function of 0-it islocally
defined.Also,
unlike itscounterpart
in CTT, the SEE does notdepend
on aspecific
group ofexaminees,
but issolely
aproperty
of the test.In CTT, the SEE
gives
theimpression
that the testalways
provides
the extent of error indicatedby
its value.Common sense
suggests, however,
that no test can beappropriate
for every group of examinees. Forex-ample,
if asimple
arithmetic addition test is administered tocollege
mathematicsmajors,
almost everyone willcorrectly
answer all the items. In such a case, the test isuseless,
because the results do not reflect the individual differences among the examinees. If a calculus test is administered toelementary
schoolstudents,
opposite-but
similar-results will occur. Thus each test is effectiveonly locally
on the 0 dimension. The SEMof the test should
differ,
therefore,
for groups of examinees at different locations on the 0 scale. It is moreappropriate
to consider the error of measurement as a function of0,
as IRT models do.Prediction of the
e9I~~~II~y
Coefficient the SEM for aSpecifle
0 DistributionUsing
the TIFUsing
the TIF, it ispossible
to link CTT with IRTthrough
theprediction
of r°x X, and the SEM fora specified
8 distribution or a
specified
group of examinees(Samejima, 1977b).
’ ’
Let
at
be any estimator of0,
were £ denotes the error variable. In the test-retest
situation,
where a
subscript
1 indicates the test, and asubscript
2 indicates the retest. If it can be assumed that in thetest-retest
situation,
and
then
where
0*
represents eitheratl
orQ~,
and e represents either £1 or E,.Thus,
the correlation between8tl
and
8tz
isNote that if 8 is
replaced
by
one of its transformed forms(i.e.,
the true test score~’),
and if the observedtest score X is used as the estimator of
T’,
and E is used as its error ofestimation,
thenEquation
9 can be rewritten asthen
Equation
15 becomes a familiarequation
for thereliability
coefficient fy^,9
In
general,
reduced to
When the MLE of 0 is
used,
let8y
or 6
denote the MLE of 0 based on the responsepattern
V.Samejima
(1977a,
1979)
observed that even with arelatively
small number of test items the conditional distribution of6,
given
0,
can beapproximated by
the normal distributionN{8,[7(6)]~}
if two conditions hold. Condition 1is
that 11
must bepractically
conditionally
unbiased for the 8 interval of interest. Condition 2 is that1(0)
must assume
reasonably high
values for thatspecific
interval. The bestapproximation
occurs when thisinterval covers the range of 8 within which most examinees are located. When this is the case, from
Equation
19,
where y(e)
denotes thedensity
function of 6 fora specific
group of examinees.Thus,
fromEquation
15,
Equation
21 indicates that thereliability
coefficient)r(6j, #~))
can bepredicted
by a
single
administration of the test,given 1(0)
and the 0 distribution of the examinees.A
.
It also has been observed
(Samejima,
1977b)
that incomputerized adaptive
testing
(CAT),
~°(®h ®2~
canbe
predicted
if aspecified
amount of test information is used as thestopping
rule fora given
0 level in thetest and retest situations.
Thus,
where
7~(6)
and~(0)
are the preset criterion Tips in the test and retestsituations,
respectively,
which areadopted
as thestopping
rules for the twotesting
sessions(a subscript
1 indicates the testsituation,
and asubscript
2 indicates the retestsituation).
Note that these two criterion Tips need not be the same for thetest and the retest, nor do
they
need to be constant for all 0. Note also thatr(6p 6j
is obtainable from asingle
testadministration,
because all that is needed is thesample
varianceof 0,
toreplace
Var(ej
inEquation
22.t-(61,62)
willchange
if thepreset
criterion TiFs[/~(8),
I~z~(~)9
orboth]
change.
For thesimpli-fied case in which the same amount of test information is used as the criterion for
terminating
thepresen-tation of new items for every examinee in each of the test and retest
situations,
respectively, Equation
22can be rewritten as
where
U2
and~2
are thereciprocals
of the constant amounts of criterion test information in the twotesting
situations,
respectively.
If a constant amount of test information is used as thestopping
rule for everyexaminee in both the test and retest
situations,
thenr(6,,
6,)
takes thesimplest
form whereCY2
denotes thereciprocal
of this common constant amount of test information.The
appropriateness
of the above normalapproximation
of the conditional distribution of6,
given
0,
can be examined
using
the monte carlo method(see Samejima, 1977a).
A necessary condition for thisapproximation
is that6
ispractically
conditionally unbiased,
thatis,
theregression
of6
on 0 is very closeintro-duced below. Note that the MLEBF
together
with the 0 distribution of thetarget
population
also determines whether theassumption
describedby Equation
13 can beaccepted.
Because the SEM in CTT is for the entire group of
examinees,
the SEM, Og, of a test attributed to aspecific
0 distribution can be written as
when Conditions 1 and 2 described above hold.
Assume that the test items have
already
beencalibrated;
thatis, the
itemparameters
of each item have been estimatedfollowing
anappropriate
mathematical model for~(8).
This can be done eitherseparately
.from, or
simultaneously
with,
maximum likelihood estimation of the individualparameter
0 for anonadaptive
test. In the formersituation,
8y
will be obtained for every individualby using
the estimated itemparameters.
In CAT, the itemparameters
for the items in the itempool usually
have been estimated beforeusing
the itempool
foradaptive testing; using
these estimated itemparameters
the individualparameter
8y
is obtained as a result oftesting,
which is based on a tailored subset of the itempool
for eachexaminee.
Thus,
Var(6) and Var(9J in
Equations
21 or 22 can becomputed.
From
Equations
2, 3,
and8,
1(0)
for anonadaptive
test can be estimatedusing
the estimated itemparameters. The
density function,
f(O),
can either be estimated(e.g., Samejima, 1981)
or set apriori if
prediction
is the purpose, as it is here.Thus,
E{[/(6)]’’}
is obtained with1(0)
replaced by
the estimated TIPin
Equation
20,
approximating
the areaby
a number ofrectangles
of small widths orfollowing
Simpson’s
quadrature
formula(Elderton &
Johnson, 1969)
for theapproximation
to theintegration.
Using
the result ofEquation
20 and thecomputed
sample
variance of6,
thatreplaces
Var(6)
inEquation
21,
ther)6),6~j
attributed both to the test and to thespecific
examinee group is obtained.Similarly,
the SEM can beobtained
by Equation
25. For a CAT,~(®a9 can
be obtainedby using
thepreset
criteria―7~(6)
for the testand
~)(9)
for the retest-and thesample
variance of9j
thatreplaced
Var( 6j)
inEquation
22.If conditional unbiasedness is not
supported
for the 0 interval ofinterest,
appropriate
modifications forEquations
21 and 25 are needed. This can be doneusing
the MLEBF(Lord,
1983;
Samejima,
1997, 1993a,
1993b)
and the modified ’TIFs(Samejima,
1990).
The l~L~~~ and Two Modified TIF Formula
Lord
(1983) proposed a
bias function for the MLE of 0 in thethree-parameter
logistic
model;
itsoper-ating
characteristic for the correct response,~(0),
isgiven by
where
ag,
bg,
andcg
are the itemdiscrimination,
itemdifficulty,
andguessing
parameters,
respectively,
and D is a
scaling
factor,
which is setequal
to 1.7 when thelogistic
model is used as a substitute for thenormal
ogive
model. Lord’s biasfunction,
denotedby
~~®9 ®~,~9
can be written asIn
Equation
27,
the bias should benegative
when&dquo;’i8)
is less than .5 for all the items[which
isitems
[which
alsonecessarily happens
for someinterval,
(8H, +00)],
and in between these values of 8 thebias tends to be close to 0.0. This is obvious because the last factor on the
right-hand
side ofEquation
27 assumesnegative
values for some items andpositive
values forothers,
and their total tends to be close to0.0,
provided
that thebgs
are distributed over a wide range of 0. Lord(1984)
applied
this MLEBF to an 85-item verbal test and found that the bias waspractically
0.0 for a wide range of 0.Samejima
(1987, 1993a, 1993b)
expanded
the MLEBF to include any discrete item responses. TheMLEBF in the
general
case can be written aswhere
Ak(8)
is the basic function(Samejima, 1969)
for the discrete item responsek,,
andP&dquo;(0)
and#&dquo;(0)
denote
the
first and secondpartial
derivativesof fl (0)
withrespect
to0,
respectively.
Forthe
graded
re-sponse model in which item score
x~
assumes successiveintegers,
0through
mg,
eachkg
inEquation
29 mustbe
replaced
by
thegraded
item scorex~.
For a dichotomous responsemodel,
it can be reduced to the formwhere
g’(0)
andPa’(®)
indicate the first and secondpartial
derivatives of~(8)
withrespect
to0,
respec-tively. Equation
30 includes Lord’s bias function in thethree-parameter
logistic
model as aspecial
case.Samejima
(1990)
proposed
two modified formulas for the TIF.They
both use the MLEBF. One modified version takes thereciprocal
of anapproximate
minimum variancebound,
and the other modified versiontakes an
approximate
minimum bound of the meansquared
error of the maximum likelihood estimator. The first modified TIF,Y’(®),
is definedby
The first
partial
derivative of the MLEBF withrespect
to0,
which is used inEquation
31,
isprovided by
for the
general
case of discrete item responses, where#[10)
andI’(0)
denote the third and the firstpartial
derivatives
of ~ (8)
and1(0)
withrespect
to0,
respectlvcly.
For a set of dichotomousitems,
Equation
32becomes
’
where
It’(S)
indicates the thirdpartial
derivative offl(0)
with respect to 0(see Samejima,
1987, 1990,
1993a, 1993b).
Equation
31 shows that therelationship
between this new function and theoriginal
TIFdepends
on thefirst
partial
derivative of the MLEBF. To be moreprecise,
if thepartial
derivative ispositive, 1’(8)
will be lessthan
1(0);
if it isnegative,
thisrelationship
will bereversed;
if it is 0.0(i.e.,
if the MLE isconditionally
The second modified TIF,
3(8),
is definedby
The difference between
1’(8)
and3(8) (Equations
31 and34,
respectively),
is the second and last term inthe braces of the
right-hand
side ofEquation
34. Because this term isnon-negative throughout
the entire range of0,
S(8) :5
1’(8) regardless
of theslope
of the MLEBF.When the MLEBF of a test is
monotonically
increasing,
Equations
31 and 34 show thatT(0)
and3(8)
will never belarger
than1(0).
In thisspecific
case,S(8) :5
1’(8) :5
1(0) throughout
the entire range of 8.~°(~i9 ®~)
and the SEM When Conditional Unbiasedness of the MLE of 0 Does Not HoldWhen
practical
conditional unbiasedness of the MLE of 0 does not exist-thatis,
B(o;
8y)
is notapproxi-mately
0.0 for all values of 0 in the interval ofinterest-T(0)
orE(9)
should be substituted for/(6) in
Equations
21 and 25.Thus,
Equation
21 can be rewritten asor
In
theory, E(0) is
moreappropriate,
but in many casesdiscrepancies
betweenT(6)
andE(6)
are small(see
Samejima,
1990), so T(6)
can be agood
substitute for~,(®). ~ls®9
in CAT,T(6)
orE(6)
can be used as thestopping
rule inplace
ofI(0), ~nd
Equation 22
can be revised into the forumsor
In the same way, the two modified sEt~~9
and
~£,Z3 ~hich
are attributed to aspecific distribution
of8,
can be rewritten asand
Equation
29 shows that~e;6y)
can be estimated if each item has been calibrated and the estimatedfl (0)
has been obtained.Thus,
using
this estimated MLEBF,T(6)
andE(6)
are estimatedusing
Equations
3 and34,
respectively.
Using
these two modified TIFs, bothr(6,, 6~
and the SEM can be obtainedusing Equations
35, 36, 39,
and40,
in a similar manner as was described when1(0)
was used.p~~
Applications
TIFs and Predicted
r(6,, 6,)s
and SEMsMethod. Six distributions of 0 were
hypothesized.
Predictions were made for each of the six 0 distri-butions about ther(6,,6,)s
and SEMs that would be obtained for ahypothetical
testusing
the three differentTIF formulas. From
Equations
21 and25,
ther°(~,, ®Z)
and SEM were obtainedusing
1(8); fromAE9uations
35 and39, the
~(®,9 62)
and SEM were obtainedusing
T(8);
and fromEquations
36 and40, the
r(®1, (2)
and SEMwere obtained
using ~(0). T’he
sixhypothetical
distributions of 0 werenormally
distributed with different means and standard deviations: Distribution1,
N(O, 1);
Distribution2,
N(-.8,
1);
Distribution3,
~1(0, .5)9
Distribution
4,
N(-.8, .5);
Distribution5,
h1(-i.6, .S)9
and Distribution6,
N(-2.4, .5),
respectively. Figure
1 shows the
density
functions of these six distributions of 8.Figure
1Density
Function of SixHypothetical
Distributions of 0:[From
Samejima (1994). Reprinted by
Permission of Kluwer AcademicPublishers]
The
hypothetical
test consisted of 30equivalent
dichotomousitems,
which followed thelogistic
model(Equation
26)
witha -
1.0, ~ -0.0, c~ = 0.0~
and I~ =1.7. Thisparticular
hypothetical
test was selectedbecause the interval of 0 for which
practical
conditional unbiasedness of the MLEBy,
given
9,
holds wasexpected
to be small because of the commondifficulty
parameter
for all theitems; therefore,
thediscrep-ancies between
T(0)
orE(0)
and1(0)
wereexpected
to belarge
for a wider range of 0 than those for a moretypical
test. This choice was made for the purpose ofcomparing
thepredictions
ofr(b,,,,9,)
for aspecific
distribution of 0 when1(0)
was used versusT(8)
orE(0).
Another reason for this choice was considerations in CAT. In CAT,
except
for the initial few itemspresented
to anexaminee,
the tailored subset of items selected from the itempool
consists ofnearly
equivalent
items. If the same level of accuracy inestimating 6
for all examinees isdesired,
forexample,
then it is reasonable to use a
single specified
amount of test information as the criterion in thestopping
le. In so
doing,
the choice between1(0)
andT(6)
orwill
make a substantialdifference,
especially
for examinees with very
high
levels of 0 and for examinees with very low levels of0,
because in manycases the item
pool
will lackextremely
difficult andextremely
easy items.Results. The MLEBF
of
thishypothetical
30-item test is shown inFigure
2. Note that outside the interval of 0(®I.09
1.0)
the amount of bias becomesincreasingly large.
The square roots of the TIFS[1(0),
T(6),
andE(0)] are
shown inFigure
3.Tables I and 2
present
thepredicted
r(6,,62)s
and SEMs for the six different distributions of0,
Figure 2
MLEBF of the
Hypothetical Test
of 30Equivalent
Itemsequal
width of .05 andusing a
number ofrectangles.
In Table1,
the mean and standard deviation(SD)
of 0 for each of the six distributions also aregiven.
These SDs areslightly
different from the squares of thesecond
parameters
of the normal distributions-. 9915 7 versus 1.00000 for Distributions 1 and2,
and.50155 versus .50000 for Distributions
3, 4, 5,
and6,
respectively,
whereas all of the means are the same as thehypothesized
normal distributions. Thediscrepancies
in sl~s are a result ofusing
therectangle
method,
which uses thedensity
of each of theequally
spaced points
of 0 as one side and thestep width,
.05,
as theother,
in order toapproximate
the normal distributions.Table 1_ shows that the
predicted
r(6j, 62)
obtainedby using
7(8)
waswidely
distributed;
thatis,
itranged
from .20049 to .89641.fi8p 9~j
decreased as the mode of the distribution shifted from a range of 0 in whichthe amount of test information was
greater
(e.g.,
Distributions 1 and3)
to another range in which it waslesser
(e.g.,
Distribution6).
The reduction was moreconspicuous
when the SD of the normal distributionwas smaller
(compare
Distributions 1 and 2 versus Distributions 3 and4).
Thepredicted
~°~®~, ~2)s
orb-tainedusing T(8)
indicated a substantial reduction from thoseusing l(B)
for each of the six distributions of 0. For Distribution2,
r(6,,62)
decreased from .82324 to.264’~99
for Distribution 5 from .4~71 ~ to.21681;
and for Distribution 6 from .20049 to .01182. The reduction is moreconspicuous
forDistribu-tions
2, 5,
and6,
which were distributed on lower levels of 6 where thediscrepancies
betweenI(B)
andT(8)
werelarge.
Among
the sixdistributions,
thepredicted
r(6,,62)
obtainedusing T(6)
varied from .01182 to.80074,
showing
evenlarger
differences. Similar results were obtained for thepredicted
?i6,, 6~js
susing E(6);
theser(6,,62)s
varied from .01109 to .79920.Also,
for each distribution the reduction in~~~~9 ®2)
from that obtainedby Equation
35 wasrelatively
small,
asexpected
fromFigure
3. Similar results were obtained for the sems, in the reverseorder,
as shown in Table 2.In CTT, the SEM, GE, is
given by
Figure
3Square
Roots ofI(B) (Solid Line), ~’(0) (Dashed Line),
andSee) (Dotted Line)
for theHypothetical
30-Item Test[From Samejima (1994). Reprinted by
Permission of Kluwer AcademicPublishers]
obtainable
by Equation
41using
the attributedi-(61,62)s
in Table 1 inplace
off~
inEquation
41 and thecorresponding
SEMs, which were obtainedby Equations
25, 39,
and 40. Forexample, using
the values ofr(6j,
02
obtained for Distribution 1 from the different TIF formulas[1(0), T(0).,
andthese
results were.31914, .46453,
and.47936,
respectively;
for Distribution 3they
were.21433, .22388,
and.22475;
and forDistribution 6
they
were.44846, .49857,
and .49876. There are differences in thedegree
that this set of threevalues differ from the
corresponding
set in Table 2. These differentdegrees
ofdisagreement
may beex-pected,
for thedegree
of violation from theassumptions
behind CTT was different for each distribution of 0. The threepredicted
error variances of the MLE of 6 arcpresented
in Table2,
for each of the sixhypo-thetical distributions.
They
were obtainedusing Equation
20,
andby
similarequations
in which1(0)
wasreplaced
by
T(0)
orrespectively.
0 was divided into small intervals of .05width,
and a number ofrectangles
were used forapproximate
integration.
Simpson’s quadrature
formula(Elderton &
Johnson,
1969)
also could have been used andperhaps
would haveprovided
more accurate results.Predicted
Versus Empirical
ReiiabilitiesNlethocl. In order to evaluate the
resulting
predicted
r~®1, 6~)s
obtainedby using 1(0), T(9),
andE(6),
a set ofempirical
r(6,, ®2)s
for the six 0 distributions were used as criteria based on simulated data.Follow-Table 1
Obtained Mean and SI~s of 0 and Predicted
r 6,,
6,)s
for Each of the Six Distributions of 0 Using 1(0),Y°~®),
and~,(®)
Table 2
Predicted SEMs and Theoretical Error SDs for Each of the Six 8 Distributions
Using I(e), ~(~~,
and~(®~
ing
each of the six distributions of0,
a group of examinees washypothesized. A
responsepattern
of eachhypothetical
examinee wasproduced
for the test and retest situationsusing
the monte carlo method. Because the test consisted of 30equivalent
dichotomous testitems,
the number-correct(NC)
test score was a sufficient statistic for the responsepattern, and
the MLE of 0 was obtained as a one-to-onemapping
of this sufficient statistic
(Lord &
Novick, 1968,
p.429).
There were1,998
hypothetical
examinees for Distributions 1 and2,
and2,004 for
Distributions3, 4, 5,
and 6.A
problem
arose as to how to deal with the --s and +as obtained as the MLEs of 0, before the corre-lation coefficient between the two sets ofb,s
could becomputed.
Table 3 presents thefrequencies
of these two extreme values in the test and retest situationsseparately,
and of those in bothsituations,
for each of the six distributions of 0.Although
there wereonly
three -ocs in the retest and no other --s or+ms for Distribution
3,
more than half of the examinees in Distribution 6 obtained -- as their MLE of 0in the initial test as well as in the retest, and more than one-third of the total examinees obtained -a in
both. As is common
practice,
the --s and -~--s werereplaced by arbitrary single
values of -2.65 and2.65,
respectively,
and the correlations werecomputed using
these values.Table 3
Frequencies
of --s and +°°s Obtained as MLEs of 6 in the Initial Testand the tZetesi, and in Both, for Each of the Six Distributions of 0
~e~~rlts. Test-retest
reliability
correlations,
together
with the two means, the twovariances,
and thecovariance,
arepresented
in Table 4. These results werecompared
to thepredicted
r(o,, 6~)s
in Table 1. ForDistribution
3,
for whichonly
threereplaced
values were used(see
Table4),
theempirical
r(6,,6.,)s
werevery close to the
predicted
values;
thatis,
.80724 versus .81738[1(0)],
.80074[T(9)],
and .799’-Note, however,
that ingeneral
theempirical
1-(6-1 61)
becamelarger
than thepredicted
value as the totalfrequency
of the --s and +°°s increased. This enhancement isartificial, however,
becauseby using
-2.65and 2.65 in
place
of -- and +00,respectively,
those who obtained -- as their MLE of 0 both in the test andretest situations were treated as if
they
obtained the same estimated 0 in the twotesting
situations,
eventhough
the distance between the two ―~s could beinfinitely large,
forexample.
The samelogic
applies
for(see
Tables 1 and4). For
Distribution4,
thefrequencies
of +- were 0 in the test, the retest, and the combi-nation ofboth,
and thefrequencies
of -00 were as small as 56 and 45 in the test and retestsituations,
respectively,
withonly 14
individualsoverlapping
in both(i.e.,
the second smallestfrequencies
next tothose for Distribution
3).
Table 4
Empifical r(6~, 6~)s
for Each of the Six Distributions of 0 Based on the MLEs of theSample
Examinees in the Test-Retest SituationUsing Arbitrary
Values of ~-2.65 and ±2.12271 for Infinite Estimate Values and NC Scores,and Mean and Variance of ±2.65 Scores at Test and I2etest and Their Covariances
The
empirical
a~(~1, 6,)s
alsodepended
on thereplacement
values used for -- and +-,especially
whenmany examinees received -c- or +0° as their MLE of 0
(compare
Distributions 5 and 6 to Distributions 3 and4).
Because the tworeplacement
values,
-2.65 and2.65,
that were used incomputing
theempirical
~°(9~, ®~)s
presented
in Table 4 werearbitrary,
theempirical
r(&,, 6.,)s
werecomputed
again by
changing
these two
replacement
values to --2.122‘~1 and2.12271,
respectively.
The results also are shown in Table 4.Comparing
these values with thosepresented
in the 2.65 column of Table4,
each of the values for theempirical
r(b,, 6~)s
is greater than thecorresponding
value for 2.65.Although
the increment is almost 0.0 for Distribution 3 and it is mild for Distribution4,
it issubstantially
large
for Distributions 5 and 6. Theseresults are
predictable
from the differences in the number ofreplacement
values used for the differentdistributions
(compare
thefrequencies
of --s and +-s for these distributions in Table4).
The
variability
that exists among theempirical
-r(®1, ®Z)s
across different distributions of 0might
have been the result of
using
6~
rather than the 1~C score,although
thisinterpretation
isillogical.
Thisis not the case, as is obvious from
theory.
To illustratethis,
theempirical
r~s also werecomputed
using
the NC for each distribution of 0. The results in Table 4 show the sametype
ofvariability
in theempirical
r(6j, Q~s
across the six 0 distributions as was observed when9y
was used. Theempirical r,,’2S
s based on NC wereslightly
higher
than those obtainedusing
^ov
and 2.65 or 2.12271 inplace
®f-as and+-s. This results from the fact that 0 and 30 were used for the two indeterminant scores and
artificially
enhanced the correlations. A set of
r(6,,62)s
based one
could beproduced
that are even closer to those based on the NC scoreby
adjusting
thereplacement
values for --s and +~s.When
a subgroup
of examinees who obtained the same NC score other than 0 and n isconsidered,
it canbe taken as an indicator of some sort of
homogeneity
among these in thesubgroup.
When the test score is either of these two extremevalues, however,
thistype
ofhomogeneity
cannot beassumed,
becausegiving
0or n as the test score
simply
means that the test has failed indiscriminating
theseexaminees’
0 levels.Be-cause
the r,,,,,
based on NC, or of thereliability
coefficient based on6v
with an identicalreplacement
valuefor each oo,
treats
each of these twosubgroups
of examinees as ifthey
had the same level of0,
theempirical
n6p
6,,)s
tend to behigher
thanthey
actually
are,especially
if there are many Os or ns, or both. Thisexplains
the differences between the
empirical
r(6,, 6,)s
in Table 4 and thepredicted
~°(®,, 92~s
in Table 1.Estimating Upper
Bounds of Empirical
Reliabilitiesnot be used in
obtaining
theempirical
r(6,, ê2)s
that are to be used as the criterion to evaluate the threepredicted
r( 81, (2)S.
Because there is no reasonable way to handle --s and +-s, severalattempts
were focused onobtaining
an estimate of the upper bound of theempirical
~°~®I, 6~)
for each 8 distribution.Thus,
in Method I the
resulting
r(6,,
(2)S
were obtainedby:
1.
Assigning
arandomly produced
true 0following
thedensity
functionwhere
~a=30,
to each examinee whose MLE of 0 was -00. Then the amount ofbias,
B(0; 6,),
was added tothe 0 thus
assigned,
and then the error score,produced by
randomly following
NfO, was
added,
and the
resulting
value was substituted for the --.2. This same process was followed for each examinee whose MLE of 0 was +~;
however,
thedensity
function in
Equation 42
wasreplaced by
An identical value ®f tr~e 6 was
assigned
in the test and retest to an examinee who obtained -- or +00 inboth
testing
situations. The estimate of the upper bound ofr(6,, &J
for each of the six distributions of 0 ispresented
in Table 5(Method 1 ).
Because of the way these substitute values for -cos and +--s wereproduced,
it wasexpected
that the results would be very conservative upper boundsf~r r~®1, ®~~.
Table 5
Estimated
Upper
Bounds ofr(6&dquo;6,)
by
Methods 1 and 2 forEach of the Six Distributions of 0 Based on the h4LEs of
Sample
Examinees in the Test-Retest Method_
In Method
2,
the results were obtainedby using
a truncated normal distribution togenerate a
true9;
that
is, the
population
normaldensity
function was dividedby
vertical lines into fivesegnents
whoseareas were
proportional
to thefrequencies
of:(1)
those who obtained -- on both the test and retest,(2)
those who obtained -- in the test
(or
retest)
situationonly,
(3)
those who obtained finite values for theirMLEs of 0 in the test
(or retest) situation, (4)
those who obtained +°o in the test(or retest)
situationonly,
and
(5)
those who obtained +c- on both the test and retest. The ordinate in each segment was dividedby
the total area of the
segment
to become densities.Again,
an identical value of true 0 wasassigned
in both the test and retest situations to an examinee who obtained -- or +- in both sessions. Because of the way in which these substitute values for --s and +oos wereproduced,
it wasexpected
that the results would be even more conservative than those of Method 1.The estimates of the upper bound of
n6p 6,)
for each of the six distributions of 0 for Method 2 also arepresented
in Table 5. The estimates of the upper bounds of the82)s
increasedsubstantially
over theresults obtained
by
Method 1 when the data included a substantial number of --substitute values would be used
(Distribution 6).
This is because of how the true value of 9 wasassigned
in Method
2;
it made thevariability
in the substitute values for -- too small.Although
the results of Method 1provide legitimate
upper bounds of theempirical
r(bl, 62)
for thedifferent 0
distributions,
it is obvious fromtheory
thatthey
are still far too conservative.Thus,
procedures
similar to the above two methods were followed without
assigning
an identical true 0 to an examinee who obtained -- or +00 in both the test and retest situations. The results also are shown in Table 5(Method
1Modified and Method 2
Modified).
These two sets of results were similar to each other. Note that the twovalues for each distribution of 0 were between the
predicted
r(6,,
6,)s
obtainedby
Equations
21[1(0)]
and 35[T(9)],
respectively (see
Table1).
Discussion Practical
~~p~~~~t~~~s
The above
examples
were based on ahypothetical
test of 30equivalent
items,
whichprovided
large
discrepancies
between theoriginal
TIF and the two modified TIF formulas.However,
this test is not thekind of test
usually
used inpractice. Figure
4 presents the square roots of theoriginal
TIF and its twomodification formulas for two
empirical
tests-the Iowa Level 11Vocabulary
Subtest(Figure
4a),
whichconsisted of 43
dichotomously
scoreditems,
and Shiba’s Word/PhraseComprehension
Test .11(Figure
4b),
which consists of 54dichotomously
scored items(see
Samejima,
1993a,
1993b).
Figures
4a and 4b show that(1)
the three curves are flatter than those of the 30equivalent
testitems,
(2)
the decrease in the amount of information is not as radical as 0departs
from the modalpoint,
and(3)
thediscrepancies
between the square root of theoriginal
TIF and those of the modified formulas are not asconspicuous
(Figure
4 versusFigure 3).
However,
the two curves for the modified formulas are almostoverlapping,
as was observed with the 30equivalent
test items(Figure 3).
Tables 1 and 5 showed
that,
although
all threepredictions
ofr(6&dquo;6,)
were accurate when the distribu-tion of 0 was in the interval of 0 in which the amount of test information waslarge,
when the 0distribu-tion shifted away from this interval
Equation
35 orEquation
36,
in which1(0)
inEquation
21 isreplaced
by I(8)
orE(0), respectively,
were betterpredictors
thanEquation
21. Thus the MLE bias function is useful inpredicting
r(ol, 6,).
Considering
that the values in Table 5 are upper bounds of thereliability
coeffi--cients,
Equation
36 in whichE(0) is
used will be the mostappropriate
formula.Thus the results of the
present
research suggest that it is advisable to useE(9)
rather than1(0)
forpredicting
thereliability
coefficient of a test attributed to aspecific
distribution of8,
as well as thecrite-rion in the
stopping
rule of a CAT. With a finite number of test items the MLE Of 0 isconditionally biased,
given
0.Therefore,
intheory
the useof E(0),
which is based on the minimum bound of the meansquared
error ofQy
rather than the minimal variance bound(Samejima, 1990),
is more reasonable than that ofT(9)
also,
although
the two resultsprovided
similar values here.These
examples
were selectedintentionally
to make the differences among the different 0 distribu-tions and among the threepredicted
~~®,9 ~Z~s
for each 0 distributionsubstantially large, using
equiva-lent test items. Becauseequivalent
test items are seldom used in actual tests, the differences betweenthe
resulting
predicted
reliability
coefficients obtainedby
using 1(0)
andby using
eitherT(9)
orE(6)
arc
expected
to be less.Because of more useful and informative measures like the TIF and its two modified
formulas,
thereliability
coefficient of a test is nolonger important
in modern testtheory.
It isinteresting,
however,
topredict
thereliability
coefficient-which is attributed to separate groups ofexaminees-using
the Tips. The traditionalconcept
of testreliability
ismisleading,
because thereliability
coefficient of the same test canbe
drastically
different for different groups of examinees.If the
reliability
coefficient must beused,
apractical suggestion
is tocompute
thepredicted
Figure
4Square
Rootsof lee) (Solid
Line),T(0) (Dashed Line),
and 2(e)(Dotted Line)
for Two TestsFollowing
theLogistic
Modelwhich the test is
likely
to be administered. In this way, one of thereliability
coefficients can beselected,
depending
on the group from which the examineesbeing
tested have beensampled.
This willimprove
the current use of thereliability
coefficient in which asingle
value is recorded in the test manual as thereliability
coefficient and is used with no consideration of thepopulation
from which thesample
of examinees have been selected.References
Birnbaum, A.
(1968).
Some latent trait models and theiruse in
inferring
an examinee’sability.
In F. M. Lord & M. R. Novick, Statistical theoriesof mental
testscores
(pp. 397-479). Reading
MA:Addison-Wesley.
Elderton,
W. P., &Johnson,
N. L.(1969). Systems of
fre-quency curves. London:Cambridge University
Press.Lawley,
D. N.(1943).
Onproblems
connected with itemselection and test construction.
Proceedings of the
Royal Society of Edinburgh,
61A, 273-287.Lord, F. M.
(1952).
Atheory
of testscores.
Psychometric
Monograph,
No. 7.Lord, F. M.
(1983).
Unbiased estimators ofability
pa-rameters, of their variance, and of their
parallel-fonns
reliability. Psychometrika, 48,
233-245.Lord, F. M.
(1984, April). Technical problems
arising inparameter estimation.
Paper presented
at the annualmeeting
of the American Educational ResearchAsso-ciation, New Orleans.
Lord, F. M., & Novick, M. R.
(1968).
Statistical theoriesof mental
test scores.Reading
MA:Addison-Wesley.
Samejima,
F.(1969).
Estimation ofability using
are-sponse pattern of
graded
scores.Psychometrika
Mono-graph,
No. 17.Samejima,
F.(1972).
Ageneral
model forfree-response
data.Psychometrika Monograph,
No. 18.Samejima,
F.(1977a).
Effects of individualoptimization
in
setting
boundaries of dichotomous items onaccu-racy of estimation.
Applied Psychological
Measure-ment, 1, 77-94.Samejima,
F.(1977b).
A use of the information function in tailoredtesting. Applied Psychological
Measure-ment, 1, 233-247.Samejima,
F.(1979). Convergence of the
conditionaldis-tribiatton
of
the maximum likelihood estimate, givenlatent trait, to the asymptotic
normality:
Observations madethrough
the constcantinformation
model(Office
of Naval Research
Rep.
No.79-3).
Knoxville:Depart-ment of
Psychology, University
of Tennessee.Samejima,
F.(1981). Efficient
methodsof estimating
the operating characteristicsof item
response categories andchallenge to
a newmodelfor
themultiple-choice
item(Office
of Naval Research Final Report ofN00014-77-C-0360).
Knoxville :Department
ofPsy-chology, University
of Tennessee.Samejima,
F.(1987).
Bias function ofthe
maximum like-lihood estimateof ability for
discrete item responses(Office
of Naval ResearchReport
No.87-1).
Knox-ville:Department
ofPsychology, University
of Ten-nessee.Samejima,
F.(1990). Modifications of the
testinforma-tion
function (Office
of Naval ResearchReport
No. 90-1). Knoxville: Department ofPsychology,
Univer-sity
of Tennessee.Samejima,
F. (1993a).
Anapproximation
for the bias func-tion of the maximum likelihood estimate of a latent variable for thegeneral
case where the item responsesare discrete.
Psychonietrika,
58, 119-138.Samejima,
F. (1993b).
The bias function of the maximum likelihood estimate ofability
for the dichotomousre-sponse level.
Psychometrika,
58, 195-209.Samejima,
F.(1994).
Roles of Fisher type information in latent trait models. In H.Bozdogan (Ed.),
Proceed-ingsof the first US/Japan conference
onthe frontiers
of statistical modeling:
Aninformational approach
(pp. 347-378).
The Netherlands: Kluwer.Acknowledgments
This research was
supported by
the Office of NavalRe-search Contracts N00014-87IK-0320 and N00014-90-J-1456.
Author’s Address
Send requests for