Research
Report
Application of the
Stochastic EM Method to
Latent Regression Models
Matthias von Davier
Sandip Sinharay
Application of the Stochastic EM Method to Latent Regression Models
Matthias von Davier and Sandip Sinharay ETS, Princeton, NJ
ETS Research Reports provide preliminary and limited dissemination of ETS research prior to publication. To obtain a PDF or a print copy of a report, please visit:
Abstract
The reporting methods used in large scale assessments such as the National Assessment of Educational Progress (NAEP) rely on a latent regression model. The first component of the model consists of a p-scale IRT measurement model that defines the response probabilities on a set of cognitive items in p scales depending on a p-dimensional latent trait variable
θ = (θ1, . . . θp). In the second component, the conditional distribution of this latent trait
variable θ is modeled by a multivariate, multiple regression on a set of predictor variables, which are usually based on student, school and teacher variables in assessments such as NAEP.
In order to fit the latent regression model using the maximum (marginal) likelihood estimation technique, multivariate integrals have to be evaluated. In the computer program MGROUP used by ETS for fitting the latent regression model to data from NAEP and other sources, the integration is currently done either by numerical quadrature (for problems up to two dimensions) or by an approximation of the integral. CGROUP, the current operational version of the MGROUP program used in NAEP and other assessments since 1993, is based on Laplace approximation that may not provide fully satisfactory results, especially if the number of items per scale is small.
This paper examines the application of stochastic expectation-maximization (EM) methods (where an integral is approximated by an average over a random sample) to NAEP-like settings. We present a comparison of CGROUP with a promising implementation of the stochastic EM algorithm that utilizes importance sampling. Simulation studies and real data analysis show that the stochastic EM method provides a viable alternative to CGROUP for fitting multivariate latent regression models.
Acknowledgements
The authors thank John Mazzeo, Shelby Haberman, Alina von Davier, Neal Thomas, Andreas Oranje, Ying Jin, and Matthew Johnson for useful advice, Steve Isham for help with the data sets used in the analysis, and Kim Fryer for help with proofreading.
1 Introduction
National Assessment of Educational Progress (NAEP), the only regularly
administered and congressionally mandated national assessment program (see, e.g., Beaton & Zwick, 1992), is an ongoing survey of the academic achievement of the school students in the United States in a number of subject areas such as reading, writing, mathematics, and the like. It is administered by the National Center for Educational Statistics (NCES), a part of the U.S. Department of Education, to a selected sample of students in Grades 4, 8, and 12. A document called Nation’s Report Card reports the results of NAEP for a number of academic subjects on a regular basis. A comparison is provided to the previous assessments in each subject area. The academic achievement is described as average student proficiency for all students in the U.S. and as the percentage of students attaining fixed levels of proficiency (the levels being defined by the National Assessment Government Board as what students should know at each grade level) in different subjects. In addition to producing these numbers for the nation as a whole, NAEP reports the same results for different subpopulations (based on gender, race, school-type, etc.) of the student population.
NAEP is prohibited by law from reporting results for individual students, schools, or school districts and is designed to obtain optimal estimates of subpopulation characteristics rather than those of individual performance. To assess national performance in a valid way, NAEP must sample a wide and diverse body of student knowledge. To avoid the burden involved with presenting each item to every examinee, NAEP selects students randomly from designated grade and age populations (first, a sample of schools are selected according to a detailed stratified sampling plan, as mentioned in, e.g., Beaton & Zwick, 1992, and then students are sampled within the schools). NAEP then administers one of many possible booklets of items to each student. This process is sometimes referred to as “matrix sampling of items.” For example, the 2000 NAEP in mathematics assessment at grade 4 contained 173 items split across 26 booklets. Each item was developed to measure one of five subscales (a) Number and Operations (b) Measurements (c) Geometry (d) Data Analysis, and (e) Algebra. An item can be multiple-choice or constructed-response. Background (demographic) information are collected on the students through questionnaires that are
filled out by students, teachers, and school administrators. For example, the questionnaires collected information on 381 background variables for each student in the assessment. The above description clearly shows that NAEP’s design and implementation are fundamentally different from those of a large scale testing program.
NAEP reports were originally envisaged as simple lists of percents correct to individual survey items (Mislevy, Johnson, & Muraki, 1992), in the population as a whole and in subpopulations of particular interest. However, it was realized later that this approach has severe limitations (e.g., comparison is limited to groups of items common to student subpopulations), and major features of the detailed results from hundreds of items and hundreds of background variables could not be effectively communicated without some kind of statistical modeling. As a result, starting in 1984, NAEP reporting methods used a statistical model consisting of two components: (a) an item response theory (IRT) model, and (b) a linear regression model (see, e.g., Beaton, 1987; Mislevy et al., 1992). Other large scale educational assessments such as the International Adult Literacy Study (IALS; Kirsch, 2001), Trends in Mathematics and Science Study (TIMSS; Martin & Kelly, 1996), and Progress in International Reading Literacy Study (PIRLS; Mullis, Martin, Gonzalez, & Kennedy, 2003) also adopted essentially the same model.
This model is referred to as either the conditioning model, multilevel IRT model,
or latent regression model. An algorithm for estimating the parameters of this model is implemented in the MGROUP set of programs, an ETS product. The first component of the model (the IRT part) defines the responses of examinees as a set of cognitive items
to be dependent on a p-dimensional latent trait/proficiency vector θ = (θ1, . . . θp). In the
second component (linear regression part), the distribution of the proficiency variable θ is
modeled by a multivariate, multiple regression on a set of background/predictor variables, which contain information on respective students, schools, and teachers, etc.
In order to compute maximum (marginal) likelihood estimates of the parameters of
this latent regression model, the proficiency θ (usually multivariate) has to be integrated
out; this makes estimation for the model problematic. Mislevy (1984, 1985) shows that the maximum likelihood estimates can be obtained using an expectation-maximization (EM) algorithm (Dempster, Laird, & Rubin, 1977). The algorithm requires the values of the
posterior means and the posterior variances for the examinee proficiency parameters. For problems up to two dimensions, the integration is computed using numerical quadrature by the BGROUP version (Beaton, 1987) of the MGROUP program in operational settings in NAEP. For higher dimensions, no numerical integration routine is available operationally (although it may now be possible to perform numerical integration for higher dimensions; work on that is in progress), and an approximation of the integral is used. The CGROUP version of MGROUP, the current operational procedure used in NAEP and other assessments since 1993, is based on the Laplace approximation (Kass & Steffey, 1989) that ignores the higher order derivatives of the examinee posterior distribution and may not provide accurate results, especially in higher dimensions (e.g., a graphical plot for a data example in Thomas, 1993, shows that CGROUP overestimates the high examinee posterior variances).
A number of extensions to the CGROUP version of MGROUP or proposals for alternative estimation methods have been suggested. The MCMC estimation approach of Johnson and Jenkins (in press) and Johnson (2002) focuses on implementing a fully Bayesian method for joint estimation of IRT and regression parameters. The direct estimation for normal populations (Cohen & Jiang, 1999) tries to replace the multiple regression by a model that is essentially equivalent to a multigroup IRT model with some additional restrictions on item parameters and group specific variances. The YGROUP version (von Davier & Yu, 2003) of MGROUP implements seemingly unrelated regressions (SUR; Zellner, 1962) and offers generalized least squares estimation of the latent regression model (von Davier & Yon, 2004). None of these alternatives have been entirely satisfactory. The MCMC estimation approach could not be implemented for the operational model with multidimensional proficiencies and the full set of background variables because of the time-consuming nature of MCMC estimation. The direct estimation approach as proposed by Cohen and Jiang (1999) leads to biased estimates even in very small data examples with nonnormal marginal ability distributions (von Davier, 2003). The SUR approach to estimating multivariate regressions is identical to generalized least squares (GLS) if all equations have the same regressors and observed dependent variables (Greene, 2002), a result that cannot be expected to hold in latent regressions. Therefore, there is scope of
further research in this area.
Stochastic EM methods (SEM; Broniatowski, Celeux, & Diebolt, 1983; Celeux & Diebolt, 1985) have been suggested for use in EM algorithms where the E-step is hard to compute/approximate analytically or numerically. In an application of a stochastic EM algorithm, the E-step is handled by simulation. Depending on the nature of the simulation used in the E-step, a stochastic EM method can be a Monte Carlo EM (e.g., Wei & Tanner, 1990), Markov chain Monte Carlo EM (e.g., McCulloch, 1994), rejection sampling EM (e.g., Booth & Hobert, 1999), importance sampling EM (e.g., Booth & Hobert, 1999), etc. Important applications of the algorithm in psychometrics include Fox (2003), Lee, Song, and Lee (2003), Clarkson and Gonzalez (2001) and Meng and Schilling (1996), each of which employ the Markov chain Monte Carlo algorithm (e.g., Gilks, Richardson, & Spiegelhalter, 1996) to simulate in the E-step.
The importance sampling EM (e.g., Booth & Hobert, 1999) is a stochastic EM method where each integral in the E-step is approximated by an average over a random sample.
This paper examines the prospect of application of the importance sampling EM method to fit the latent regression model in NAEP-like settings. We also present a comparison of the results from the importance sampling EM algorithm with those from the CGROUP version of MGROUP, which is the technique currently used in NAEP. For a low dimensional real data example, the results from the suggested method are also compared to the BGROUP version of MGROUP, which performs exact numerical integration and is the gold standard method in such settings. Simulation studies and real data analysis show that the stochastic EM method provides a viable alternative to CGROUP for fitting latent regression models.
2 NAEP Model, Estimation, and the Current MGROUP Method
This section first states the two component statistical model used in NAEP. A brief description of the EM algorithm, which will be helpful later, follows. The next subsection describes the current NAEP estimation process and the MGROUP program used in ETS.
2.0.1 The Latent Regression Model
NAEP and other similar large scale assessments implement a latent regression
model utilizing an IRT measurement model. Assume that the unique p-dimensional latent
proficiency vector for examinee i is θi= (θi1, θi2, . . . θip)0. In NAEP, p could be 2, 3, or 5.
Let us denote the response vector to the test items for examinee i as
yi = (yi1,yi2, . . . ,yip), where, yik, a vector of responses, contributes information about θik. The likelihood for an examinee is given by
f(yi|θi) = p
Y
q=1
f1(yiq|θiq)≡L(θi;yi)· (1)
The terms f1(yiq|θiq) above follow a univariate IRT model, usually with three-parament
logistic (3PL) or generalized partial-credit model (GPCM) likelihood terms. For reasons to be discussed later, the dependence of (1) on the item parameters is suppressed.
Suppose xi = (xi1, xi2, . . . xim) are m fully measured demographic and educational
characteristics for the examinee. Conditional on xi, the examinee proficiency vector θi is
assumed to follow a multivariate normal prior distribution, that is,θi|xi ∼N(Γ0xi,Σ). The
mean parameter matrix Γ and the non negative definite variance matrix Σare assumed to
be the same for all examinee groups.
Under this setup, L(Γ,Σ|Y,X), the (marginal) likelihood function for (Γ,Σ)
based on the data (X,Y), is given by
L(Γ,Σ|Y,X) = n Y i=1 Z f1(yi1|θi1). . . f1(yip|θip)φ(θi|Γ0xi,Σ)dθi, (2)
wherenis the number of examinees, and φ(.|., .) is the multivariate normal density function.
2.0.2 EM Algorithm
The EM algorithm (e.g., Dempster et al., 1977) is commonly used to obtain maximum likelihood estimates or posterior modes in problems with missing data. In fact, the algorithm is applicable in a broad range of situations where probability models can be reexpressed as ones on augmented parameter spaces and the added parameters can be thought of as missing data. The EM algorithm is relevant for mixed models because
the random effects may be viewed as missing data and the algorithm may be used to find maximum likelihood estimates for these models.
Suppose that given an unknown parameter vectorω, data y follows a distribution
f(y;ω) and that
y0 = (y0
obs,y0mis),
where yobs is the observed data and ymis is the missing data. Suppose the objective is
to obtain the maximum likelihood estimate of ω, i.e., the value of ω that maximizes the
observed data likelihood
f(yobs|ω) = Z
f(yobs,ymis|ω)dymis.
The EM algorithm goes through a succession of E-steps and M-steps starting from an initial
value ω(0) of the parameter vectorω. In the (t+ 1)-th E-step, one maximizes
Q(ω|ω(t)) =E[log{f(yobs,ymis|ω)}|yobs,ω(t)], (3) the expected value of the complete data loglikelihood, where the expectation is with respect
to the distribution of ymis given yobs and ω(t). In the corresponding M-step, one maximizes
Q(ω|ω(t)) computed in the preceding E-step with respect to ω to get ω(t+1). The algorithm
is said to have converged when two successive iterations give very similar values of the
parameter vector. It is also important to monitor the value of the loglikelihood f(yobs|ω).
The EM algorithm is guaranteed to converge to a local maximum—so it is customary to start the iterations at many points in the parameter space and then to compare the values of the likelihood at all of the local maxima. This method may be very slow to converge if the proportion of missing data is high.
For simple problems, one may be able to do the averaging in (3) analytically. However, in many practical problems (e.g., in NAEP estimation), the expectation cannot be computed analytically; then one may need to use an approximation technique or a simulation technique to compute the expectation.
2.0.3 NAEP Estimation Process and the MGROUP Program
NAEP uses a three-stage estimation process for fitting the above mentioned
model (consisting of 3PL and GPCM terms) of the form in (1) to the examinee response data and estimates the item parameters. The prior distribution used in this step is not
θi|xi ∼ N(Γ0xi,Σ) as described above, but is θi ∼ N(0,I), that is, the subscales are
assumed to be independent a priori. The second stage, conditioning, assumes that the item
parameters are fixed at the estimates found in scaling and fits the model in (2) to the
data, that is, estimates Γ and Σ as a first part. In the second part of the conditioning
step, plausible valuesfor all examinees are obtained using the parameter estimates obtained
in scaling and the first part of conditioning—the plausible values are used to estimate examinee subgroup averages. The third stage of the NAEP estimation process, called variance estimation, estimates the variances corresponding to the examinee subgroup averages using a jackknife approach (see, e.g., Johnson & Jenkins, in press). Our research
will focus on the conditioningstep and assume that the scaling has already been done (i.e.,
the item parameters are fixed); this is the reason we suppress the dependence of (1) on the item parameters.
Because we will be concerned with the conditioning step, the remaining part of
the section provides a more detailed discussion of it. The first objective of this step is to
estimate Γand Σfrom the data. If the θis were known, the maximum likelihood estimators
of Γ and Σwould be ˆ Γ = (X0X)−1X0 θ0 1 θ0 2 . . . θ0 n , (4) ˆ Σ = 1 n X i (θi−Γ0x i)(θi −Γ0xi)0· (5)
However, θis are actually unknown. Mislevy (1984, 1985) shows that the maximum
likelihood estimates of Γand Σunder unknown θis can be obtained using an EM algorithm
(Dempster et al., 1977). The EM algorithm iterates through a number of expectation steps
value of the parameters in the tth M-step, is obtained as: Γt+1 = (X0X)−1X0 f θ0 1t f θ0 2t . . . f θ0 nt , (6) Σt+1 = 1 n " X i Var(θi|X,Y,Γt,Σt) + X i (θfit−Γ0t+1xi)(θfit−Γ0t+1xi)0 # , (7)
where θfit =E(θi|X,Y,Γt,Σt) is the posterior mean for examinee i given the preliminary
parameter estimates of iteration t. The process is repeated until convergence of the
estimates Γ and Σ.
Formulae (6) and (7) require the values of the posterior means E(θi|X,Y,Γt,Σt)
and the posterior variances Var(θi|X,Y,Γt,Σt) for the examinees. Correspondingly, the
tth E-step computes the two required quantities for all the examinees. The MGROUP set
of programs at ETS perform the EM algorithm mentioned above.
The MGROUP program consists of two primary controlling routines called PHASE1 and PHASE2. The former does some preliminary processing while the latter directs the EM iterations. There are different versions of the MGROUP program depending on the method used to perform the E-step in PHASE2: BGROUP using numerical quadrature, NGROUP (Mislevy, 1985) using Bayesian normal theory, CGROUP (Thomas, 1993) using Laplace approximations, YGROUP (von Davier & Yu, 2003) using SUR, and so on.
The BGROUP version of MGROUP, applied when the dimension of θi is less
than or equal to two, applies numerical quadrature to approximate the integral. When
the dimension of θi is larger than two, NGROUP, CGROUP, or YGROUP may be used.
Note that N-group and Y-group are not used for operational purposes for the following reasons: N-group assumes normality of both the likelihood and the prior and may produce biased results if these assumptions are not met (Thomas, 1993). YGROUP performs the estimation independently by implementing seemingly unrelated regressions (Zellner, 1962), which is useful for generating starting values, but does not capitalize on the covariances
between the p components of the latent trait. This may be inappropriate, for example, if
or all subjects.
The following subsection provides details about CGROUP, the approach currently used operationally in several large scale assessments including NAEP. This approach is based on approximation of posterior moments of a multivariate distribution using Laplace approximation.
CGROUP E-step. The CGROUP version of MGROUP uses the Laplace approximation of the posterior mean and variance, that is, uses
E(θj|X,Y,Γt,Σt) ≈ θˆj,mode− 1 2 p X r=1 GjrGrrˆh(3)r , j = 1,2, . . . p, (8) Cov (θj, θk|X,Y,Γt,Σt) ≈ Gjk− 1 2 p X r=1 (GjrGkrGrr) ˆh(4)r + 1 2 p X r=1 r X s=1 1− 1 2I(r =s) ˆ h(3)r ˆh(3)s Grs× (GjsGksGrr+GjsGkrGrs+GjrGksGrs+ GjrGkrGss) , j, k = 1,2, . . . p, (9) where, ˆ
θj,mode = jthcomponent of the posterior mode of θ,
h = −log[f(y|θ)φ(θ|Γ0 tx,Σt)] Gjk = ∂h ∂θr∂θs ˆ θmode −1 jk , (10) ˆ
hnr = nth pure partial derivative withrespectto θr, evaluated at ˆθmode,
and I(r=s) is 1 if r =s and 0 otherwise.
Details about the formulae (8) and (9) can be found in Thomas (1993, pp.
316-317).1 The Laplace method does not provide an unbiased estimate of the quantity it
is approximating and may provide inaccurate results if higher order derivatives (that the Laplace method assumes to be equal to zero) are not negligible. The error of approximation
for each component of the mean and covariance of θi is of orderO(1
k2) (e.g., Kass & Steffey,
1989), where k is the number of items measuring skill corresponding to the component.
NAEP is not too large (making k rather small), the errors in the Laplace approximation
may become nonnegligible, especially for high-dimensional θis. Further, if the posterior
distribution of θis is multimodal (which is not impossible, especially for a small number of
items), the method can perform poorly. Therefore the CGROUP version of MGROUP is not entirely satisfactory. Figure 1 in Thomas (1993), where the posterior variance estimates of 500 randomly selected examinees using the Laplace method and exact numerical
integration for two-dimensional θi are plotted, shows that the Laplace method provides
inflated variance estimates for examinees with large posterior variance (see also Section 6
in this report). The departure maybe more severe for θis in higher dimensions.
3 The Suggested Stochastic EM Method
Before going into the details of the suggested technique, we provide a brief description of the importance sampling method (e.g., Geweke, 1989.).
3.1 Importance Sampling
The Radon-Nikodym theorem (e.g., Bauer, 1972) forms the basis for importance
sampling. This theorem states that a measure M over Ω can be expressed in terms of
another measure N over Ω
M(A) = Z A f dN ↔ Z A m(ω)dω = Z A n(ω)f(ω)dω
if N(A) = 0→ M(A) = 0 for all A. The function f is referred to as the Radon-Nikodym
derivative and is sometimes written as f = ∂M∂N. In the case that m, n are probability
densities over Ω =Rp the theorem implies
Z Ω f(ω)ωn(ω)dω= Z Ω ωm(ω)dω, which is equivalent to EM(ω) =EN(f ω).
This result is applied in importance sampling to approximate intractable integrals.
Suppose we want to compute the expected value of a function g(ω) where the
expectation is taken with respect to a probability density f(ω). We can express the
expectation of g(ω) as Ef{g(ω)}= Z ω g(ω)f(ω)dω= Z ω g(ω)f(ω) h(ω) h(ω)dω (11)
for any probability density h(ω) defined on the same sample space that is zero only if f(ω)
is zero. Now, if we can generate a random sample ˜ω1, ˜ω2, . . ., ˜ωn from the distribution
with density h(ω), we can approximate Ef{g(ω)}, using (11), as
Ef{g(ω)} ≈ 1 n n X i=1 g( ˜ωi)f( ˜ωi) h( ˜ωi) · (12) The standard error corresponding to the above estimate is given by dividing the standard
deviation of the importance ratios g( ˜ωi)f( ˜ωi)
h( ˜ωi) by √
n. The standard deviation mentioned
above determines the quality of the approximation. The density h(ω) is called the
importance sampling density.
Ifh(ω) is chosen so that the importance ratio g(ωh()ωf()ω) is roughly constant over the
range of possible values of ω (which will make the standard deviation of the importance
ratios small), fairly accurate approximation of the integral may be obtained. In particular,
h(ω) should have heavier tails than the product g(ω)f(ω) since otherwise there may be a
fewωi’s for which h(ω) will be much smaller than g(ω)f(ω) and the estimate will blow up.
The Radon-Nikodym theorem requires RAh = 0→RAg = 0 to hold, which can be stated in
an approximate fashion as: Wherever h is zero, g needs to be zero. This is the theoretical
justification for the heavier tail requirement ofh over gf.
The main advantages of importance sampling are that it provides an unbiased estimate of the integral and the accuracy can be monitored by examining the standard
deviation of the g( ˜ωi)f( ˜ωi)
h( ˜ωi) values.
In some situations, the researcher knows a distribution up to a constant only, but still needs to compute the moments of the distribution. For example, often that all that is available in a Bayesian analysis is a multiple of the posterior (by multiplying the likelihood and the prior), but not the posterior itself. Importance sampling can be used in
this situation as well. Suppose, in the above setup, one does not know f(ω), but does know
q(ω)≡c.f(ω). The expectation of a function g(ω) is given by
Ef{g(ω)}= R ωg(ω)q(ω)dω R ωq(ω)dω ·
The numerator in the right-hand side of above is approximated by Z ω g(ω)q(ω)dω = Z ω g(ω)q(ω) h(ω) h(ω)dω≈ 1 n n X i=1 g( ˜ωi)q( ˜ωi) h( ˜ωi)
while the denominator is approximated by Pni=1 q( ˜ωi)
h( ˜ωi) (see, e.g., Gelman, Carlin, Stern, &
Rubin, 2003, pp. 342-343). Geweke (1989) shows that the estimate above converges almost surely to the estimand under weak assumptions and a central limit theorem (establishing
that n1/2(E d
f{g(ω)} −Ef{g(ω)}) → N(0, τ2), where τ2 can be estimated consistently)
applies under stronger assumptions. The estimation of ratio of two quantities by the ratio of estimates of them is known as ratio estimation. For more details about ratio estimation, see, for example, Cochran (1977).
3.2 Importance Sampling in the E-step of the MGROUP Program
This paper suggests approximating the posterior expectation and variance of the
examinee proficiencies θi in (6) and (7) using the importance sampling method.
The posterior distribution ofθi, denoted asp(θi|X,Y,Γt,Σt), is given by
p(θi|X,Y,Γt,Σt)∝f(yi1|θi1). . . f(yip|θip)φ(θ|Γ0txi,Σt) (13)
using (2). The proportionality constant in (13) is a function of yi, Γt, and Σt. Let us
denote
q(θi|X,Y,Γt,Σt)≡f(yi1|θi1). . . f(yip|θip)φ(θ|Γ0txi,Σt). (14)
We drop the subscripti for convenience for the rest of the section and let θ denote
the proficiency of an examinee.
We have to compute the mean and variance of p(θ|X,Y,Γt,Σt), that is, the
quantities E(θ|X,Y,Γt,Σt)≡ Z θp(θ|X,Y,Γt,Σt)dθ, and. (15) Var(θ|X,Y,Γt,Σt)≡ Z (θ−E(θ|X,Y,Γt,Σt))(θ−E(θ|X,Y,Γt,Σt))0 p(θ|X,Y,Γt,Σt)dθ (16)
Using ideas from the above description of the importance sampling method, if
we can generate a random sample θ1, θ2, . . ., θn from a distribution h(θ) approximating
p(θ|X,Y,Γt,Σt) reasonably, we can approximate (15) by the ratio of
1 n n X j=1 θjq(θj|X,Y,Γt,Σt) h(θj) and 1 n n X j=1 q(θj|X,Y,Γt,Σt) h(θj) . Similarly, we can approximate (16) as the ratio of 1 n n X j=1 (θj −E(θj|X,Y,Γt,Σt))(θj −E(θj|X,Y,Γt,Σt))0q(θj|X,Y,Γ t,Σt) h(θj) and 1 n n X j=1 q(θj|X,Y,Γt,Σt) h(θj) .
Following Booth and Hobert (1999), who apply the importance sampling EM
method for fitting generalized linear mixed models, we use a multivariate t importance
sampling density. Booth and Hobert (1999) argue that in general, a multivariate t density
is a better choice as an importance sampling density than a normal density. One reason is that the former is heavy-tailed while the latter is not, and that a good importance sampling
density should be heavy-tailed when the density it approximates may be so; also, the t
density is easy to sample from. The degrees of freedom of the t density is taken to be four
because low degrees of freedom of a t density ensures that it has a heavy tail. (The choice
of optimal degrees of freedom is an issue requiring further investigation.)
The mean and variance of the t importance sampling density are taken as input
from a nonstochastic version of MGROUP. In our implementation, the YGROUP version (von Davier & Yu, 2003) is used, which conveniently provides input for the importance sampling density in the stochastic EM algorithm.
It is also possible to estimate variances of the parameter estimates obtained from the EM algorithm as suggested by, for example, Booth and Hobert (1999). However, we are
more interested in the estimation of the latent regression parameters in this work and do not delve into the issue here.
Note that a number of studies have successfully applied other forms of stochastic EM methods, such as rejection sampling EM (Booth & Hobert, 1999) and Markov chain Monte Carlo EM, or MCMC-EM, (e.g., McCulloch, 1994), but we do not explore those approaches, mainly because of the difficulties in their application to our problem.
4 MCEM in MGROUP
This section describes the implementation of MCEMGROUP (MCEM is short for Markov chain Monte Carlo EM), the version of MGROUP that uses importance sampling to generate the necessary statistics in the E-Step. Because of the presence of a stochastic component in the E-step, this technique is also called stochastic EM algorithm.
Following the notes on the implementation, differences between assessing
convergence in nonstochastic optimization and in stochastic optimization algorithms will be discussed.
4.1 Implementing MCEMGROUP
The MCEM algorithm was integrated into YGROUP (von Davier, 2003), which is based on BGROUP and implements SUR (Zellner, 1962) and various other modifications.
Following Booth and Hobert (1999), importance sampling is used in the E-step to generate the statistics (here, posterior means, residual components, and measurement error components) necessary to perform the EM iterations for the latent regression.
The implementation uses importance sampling, employing a multivariate t (as in
Booth & Hobert, 1999). The MCEMGROUP version as implemented in YGROUP uses the SUR posterior means and the posterior residual (co-)variance matrix as fast and efficient means to compute the mean and variance of the importance sampling density.
Like in all versions of MGROUP, optional starting values for the latent regression
(Γ,Σ) can be provided but are not essential. As an alternative, initial iterations with a
nonstochastic E-step of one of the other versions can be carried out to generate starting values. The initial sample size of the importance sample can be chosen to be much smaller
than the final sample size, in order to speed up initial iterations where individual accuracy is less important than approaching a good starting value for the aggregate parameters. The examples below use 500 as initial importance sample size and 6,000 as final importance sample size.
4.2 Determining Convergence of the MCEM Algorithm
When applying a stochastic EM algorithm, it is possible that the EM algorithm has not converged, but appears so after an EM step (i.e., updated parameters and likelihoods do not change much from the previous step) because of an unlucky random sample.
One should therefore use a stricter convergence criterion, or a larger variety of criteria that all have to be met simultaneously, in order to ensure the convergence of the algorithm. We monitor the likelihood function and the parameter vector for convergence and conclude that convergence has occurred only when the relative change in both these
quantities are less than in five successive iterations.
The likelihood increases over the iterations in a nonstochastic EM algorithm (see, e.g., Dempster et al., 1977) while it may not be so for a stochastic EM algorithm because of Monte Carlo error. In order to assess convergence in stochastic optimization algorithms, the inevitable nonmonotonicity of these algorithms has to be taken into account. Each E-step in our implementation employs a different set of points (the importance sample) in the multivariate quadrature grid in order to perform the integration. This means that the likelihood of each observed response vector may vary, and hence, the likelihood of the whole sample may vary even if all other parameters are unchanged, from one importance sample to the next.
Therefore, in our implementation, we monitor absolute changes in the log likelihood,
the regression parameters Γ, and the residual (co-)variance matrix Σsimultaneously. The
maximum absolute changes are checked against user-defined stopping criteria, which are used to decide whether a significant change has occurred that demands further iterations towards the optimum.
The algorithm stops if, and only if, all three absolute changes are smaller than user-defined numbers. In addition to that, the absolute changes of previous
iterations are integrated with the current change by using the following approach:
For cycle t, define the average absolute maximum change (AAM C) as AAM Cx,t =
p×AM Cx,t+ (1−p)×AAM Cx,t−1, which averages the current absolute maximum change
(AM C) and the previous average absolute maximum change (AAM C).
The x denotes a parameter (vector) or a function of parameters (for example, the
maximum change in regression parameters in our case). t denotes the current iteration
(cycle) of the algorithm; t−1, the previous cycle; and 0≤ (1−p)≤1.0, the weight of the
averaged previous cycles criterion.
This criterion ensures that, if p < 1, more than the current change (which might
be small due to the stochastic nature of the algorithm) is taken into account when deciding
whether to stop iterations. If p = 1, we have a regular (no memory) stopping criterion,
which stops whenever the current iteration alone meets the stopping rule.
In the current implementation, the stopping rule fades out the past absolute
changes rather slowly. We found that p = 0.4 and the AAMC bound of 0.045 for relative
change in likelihood, regression weights, and variances works reasonably well. This choice balances infinite iterations against premature termination of the algorithm in the examples
presented here. The rule for stopping the algorithm is that AAM Cx,t < 0.045 has to be
satisfied for termination, a maximum change bound similar to what Booth and Hobert (1999) report.
5 Analysis of Simulated Data
This section presents the first proof-of-concept example. The simulated data set used for this example matches the structure and size of the 2000 NAEP mathematics assessment at grade 4 and comes from a previous study (von Davier & Yu, 2003). The assessment has five scales: (a) Number and Operations, (b) Measurements, (c) Geometry, (d) Data Analysis, and (e) Algebra. This section applies the stochastic EM method to find the parameter estimates for the simulated data set, which involves responses of 13,511 students. Each student responds to 1 of 26 sets of items (booklets). Each booklet contains three blocks of 12-15 items (both multiple-choice and constructed-response) out of a total 145 items. Each examinee has information on 381 predictors that are used in the latent
regression model.
This section compares estimates from the MCEMGROUP and the operational CGROUP of the posterior means and standard deviations of examinees, the regression
parameters Γ, and the residual variance-covariance matrix Σ.
As an additional benchmark, both the results of CGROUP and MCEMGROUP are compared against the generating values, that is, the individual latent trait vectors used to simulate the data.
5.1 Estimates of Γ and Σ for the Simulated Data
Figure 1 shows a comparison of the CGROUP and MCEMGROUP regression
parameters estimates (i.e., estimates of the individual parameters in Γ) for the 2000 NAEP
mathematics assessment at grade 4 by subscale. The figure points to the extreme closeness of the estimates from the two approaches.
Table 1 shows the residual variance-covariance estimates (i.e., estimates of the
individual parameters in Σ) from CGROUP and MCEMGROUP for the simulated data
set.
The table shows slightly larger estimated variances for the MCEM version of MGROUP as compared to CGROUP. The pattern is the opposite for the estimated covariances. The differences between the covariance estimates of CGROUP and MCEMGROUP seem smaller than the differences between the variance estimates of the two approaches.
5.2 Subgroup Estimates for the 2000 NAEP Mathematics Assessment at Grade 4
Figures 2 to 4 present recovery plots for CGROUP and MCEMGROUP. The figures also show subplots comparing the individual means and standard deviations from CGROUP and MCEMGROUP. Each subscale has four subplots. The recovery plots are based on individual examinee statistics, and hence will show more variance than that in
the population parameters Γ and Σ. In the recovery plots, the generating values are the
reference, and the conditional posterior moments (given the responses, the background
−0.05 0.00 0.05 0.10
−0.05
0.00
0.05
0.10
All Gamma NUM&OP
CGROUP MCEMGROUP −0.05 0.00 0.05 0.10 −0.05 0.00 0.05 0.10
All Gamma MEASUREMENT
CGROUP MCEMGROUP −0.05 0.00 0.05 0.10 −0.05 0.00 0.05 0.10
All Gamma GEOMETRY
CGROUP MCEMGROUP −0.10−0.10 −0.05 0.00 0.05 0.10 −0.05 0.00 0.05 0.10
All Gamma DATA ANALYSIS
CGROUP MCEMGROUP −0.05 0.00 0.05 −0.05 0.00 0.05
All Gamma ALGEBRA
CGROUP
MCEMGROUP
Figure 1. Regression parameter (Γ) estimates from CGROUP and MCEMGROUP for the data for 2000 NAEP mathematics assessment at grade 4.
of these parameters used when generating imputations with MGROUP) as generated by CGROUP and MCEMGROUP are plotted against these reference values. For all subscale displays, the ideal recovery or agreement line, respectively, is indicated by the main diagonal given in each plot.
Figures 2 to 4 show no obvious differences between CGROUP and MCEMGROUP
when comparing the plots showing recovery of the generating θ values. This observation
finds additional support when examining the plot of both posterior mean estimates. The posterior means generated by CGROUP and MCEMGROUP fall nicely around the diagonal line and the plot is much narrower than the “truth” recovery plots. This indicates that CGROUP and MCEMGROUP agree quite well when producing posterior means, a statement that cannot be said for the posterior standard deviation.
−4 −3 −2 −1 0 1 2
−4
−2
0
2
Num&Op Posterior mean
True Theta CGROUP −4 −3 −2 −1 0 1 2 −4 −2 0 2
Num&Op Posterior mean
True Theta MCEMGROUP −4 −3 −2 −1 0 1 2 −4 −2 0 2
Num&Op Posterior mean
MCEMGROUP CGROUP 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.25 0.40 Num&Op Posterior SD MCEMGROUP CGROUP −4 −3 −2 −1 0 1 2 3 −3 −1 1
Measurement Posterior mean
True Theta CGROUP −4 −3 −2 −1 0 1 2 3 −3 −1 1
Measurement Posterior mean
True Theta MCEMGROUP −3 −2 −1 0 1 2 −3 −1 1
Measurement Posterior mean
MCEMGROUP CGROUP 0.25 0.30 0.35 0.40 0.45 0.50 0.30 0.45 Measurement Posterior SD MCEMGROUP CGROUP
Figure 2. Posterior moments from CGROUP and MCEMGROUP compared against the generating values and the posterior means and standard deviations from CGROUP and MCEMGROUP compared against each other.
−4 −2 0 2
−4
−2
0
2
Geometry Posterior mean
True Theta CGROUP −4 −2 0 2 −3 −1 1
Geometry Posterior mean
True Theta MCEMGROUP −3 −2 −1 0 1 2 −4 −2 0 2
Geometry Posterior mean
MCEMGROUP CGROUP 0.25 0.35 0.45 0.55 0.30 0.45 Geometry Posterior SD MCEMGROUP CGROUP −4 −3 −2 −1 0 1 2 −4 0 2
Data Analysis Posterior mean
True Theta CGROUP −4 −3 −2 −1 0 1 2 −3 −1 1
Data Analysis Posterior mean
True Theta MCEMGROUP −3 −2 −1 0 1 2 −4 0 2
Data Analysis Posterior mean
MCEMGROUP
CGROUP
0.2 0.3 0.4 0.5
0.30
0.45
Data Analysis Posterior SD
MCEMGROUP
CGROUP
Figure 3. Posterior moments from CGROUP and MCEMGROUP compared against the generating values and the posterior means and standard deviations from CGROUP and MCEMGROUP compared against each other.
Table 1.
Residual Variance Covariance Estimates From CGROUP and MCEMGROUP
NUM&OPER MEASURMT GEOMETRY DATA ANL ALGEBRA
CGROUP estimates NUM&OPER 0.32592 0.30510 0.32886 0.29929 0.32429 MEASURMT 0.32481 0.31857 0.28613 0.31267 GEOMETRY 0.37103 0.30208 0.33679 DATA ANL 0.32091 0.30645 ALGEBRA 0.35941 MCEMGROUP estimates NUM&OPER 0.33872 0.29779 0.31766 0.29167 0.31602 MEASURMT 0.35134 0.30435 0.27761 0.30007 GEOMETRY 0.3883 0.29045 0.31976 DATA ANL 0.3552 0.29600 ALGEBRA 0.38106
Note. Based on the operational population model for the 2000 NAEP mathematics
assessment at grade 4.
The posterior standard deviations given by MCEMGROUP are considerably larger than for CGROUP for all subscales. It is to be determined whether this is a systematic difference, whether it is due to Monte Carlo inflation of the variance, or whether this depends on some other structural variable such as the correlation between scales or the sample size.
Table 5.2 compares the subgroup means from CGROUP and MCEMGROUP for relevant subgroups—there seems to be little difference between the two methods from this aspect as well.
Table 5.2 results show that MCEMGROUP does not perform much differently from CGROUP when reproducing group level statistics such as subgroup means and variances.
−4 −3 −2 −1 0 1 2 3 −4 −2 0 1 2
Algebra Posterior mean
True Theta CGROUP −4 −3 −2 −1 0 1 2 3 −3 −1 0 1 2
Algebra Posterior mean
True Theta MCEMGROUP −3 −2 −1 0 1 2 −4 −2 0 1 2
Algebra Posterior mean
MCEMGROUP CGROUP 0.2 0.3 0.4 0.5 0.30 0.40 0.50 Algebra Posterior SD MCEMGROUP CGROUP
Figure 4. Posterior moments from CGROUP and MCEMGROUP compared against the generating values and the posterior means and standard deviations from CGROUP and MCEMGROUP compared against each other.
6 Analysis of Real NAEP Data: 2003 Reading Assessment at Grade 4 The 2003 NAEP reading assessment at Grade 4 has data from 187,581 examinees
in the fourth grade or 9 years of age. Each student answers a literary block (to assess the
ability to read for literary experience) and an informative block(to assess the ability to read
for information). Each block consists of 10 to 12 items, with a mixture of multiple-choice and constructed-response items. We will refer to the two skills (subscales) measured by the
assessment as literary and information. The combined item sample for these two scales has
111 items in total; 688 background variables are used in the latent regression model.
Because the θis are two-dimensional, it is possible to apply BGROUP (which
performs exact numerical integration in the E-step) here. BGROUP provides the gold standard, and we will compare the results obtained from the application of the stochastic EM against it. Comparison with results from CGROUP will be provided as well.
Table 2.
Comparison of Subgroup Estimates From CGROUP and MCEMGROUP
The CGROUP MCEMGROUP
subgroup NumOp Meas Geom DA Alg NumOp Meas Geom DA Alg
Overall 0.033 0.006 –0.023 0.047 0.037 0.031 0.009 –0.013 0.060 0.039 Male 0.069 0.083 –0.043 0.061 0.089 0.066 0.086 –0.030 0.075 0.092 Female –0.004 –0.072 –0.003 0.032 –0.016 –0.006 –0.071 0.004 0.045 –0.016 White 0.272 0.310 0.248 0.361 0.270 0.270 0.311 0.254 0.369 0.269 Black –0.618 –0.838 –0.748 –0.794 –0.624 –0.617 -0.833 –0.738 –0.777 –0.623 Hispanic –0.454 –0.566 –0.550 –0.525 –0.420 –0.459 –0.560 –0.527 –0.493 –0.411 Asian 0.605 0.388 0.532 0.369 0.680 0.605 0.392 0.538 0.382 0.688 Amerind –0.407 –0.287 –0.693 –0.515 –0.429 –0.412 –0.271 –0.630 –0.154 0.014
6.1 Estimates of Γ and Σ for the Real Data
Table 3 shows the residual variance estimates ˆΣ as generated by BGROUP,
MCEMGROUP, and CGROUP for the NAEP data. Table 3.
Residual Variances, Covariances, and Correlations for the 2003 NAEP Reading Assessment at Grade 4 Data
BGROUP MCEMGROUP CGROUP
Lit. Info. Lit. Info. Lit. Info.
Lit. 0.50601 0.41801 0.48479 0.41284 0.51547 0.41782
Info. 0.82130 0.51193 0.84360 0.49401 0.80241 0.52598
Note. Lit. = Literary, Info. = Information, Residual variances = main
diagonals, covariances = upper off-diagonals,and correlations = lower off-diagonals.
The three estimates for the residual correlation and the residual variances are close. Interestingly, the estimates generated by MCEMGROUP are slightly smaller than
CGROUP estimates for the reading data, whereas the residual variances estimates for the simulated math were larger for MCEMGROUP than for CGROUP. As an outcome, the correlation is higher for MCEMGROUP than for CGROUP. Whether this is an effect of the sample size or the number of scales needs further investigation.
Figure 5 shows the regression parameter estimates as generated by CGROUP and MCEMGROUP plotted against the estimates generated by the gold standard for this data example, BGROUP. −0.04 −0.02 0.00 0.02 0.04 −0.04 0.00 0.02 0.04
All Gamma LITERARY
BGROUP MCEMGROUP −0.06 −0.04 −0.02 0.00 0.02 0.04 −0.06 −0.02 0.02
All Gamma INFORMATION
BGROUP MCEMGROUP −0.04 −0.02 0.00 0.02 0.04 −0.04 0.00 0.02 0.04
All Gamma LITERARY
BGROUP CGROUP −0.06 −0.04 −0.02 0.00 0.02 0.04 −0.06 −0.02 0.02
All Gamma INFORMATION
BGROUP
CGROUP
Figure 5. Regression coefficients for the 2003 NAEP reading assessment at grade 4. The figure shows that the 688 regression coefficient estimates each for the literary and the information scales agree to a large extent between the different versions of
MGROUP. Almost all points are found on the main diagonal, which indicates that the ˆΓ
estimates produced by BGROUP, CGROUP, and MCEMGROUP are virtually identical.
6.2 Subgroup Estimates for the 2003 NAEP Reading Assessment at Grade 4
Figure 6.2 presents plots of the posterior means as generated by CGROUP and MCEMGROUP compared to the posterior means generated by BGROUP. Figure 7
−3 −2 −1 0 1 2 −3 −2 −1 0 1 2
LITERARY Posterior mean
BGROUP MCEMGROUP −3 −2 −1 0 1 2 −3 −2 −1 0 1 2
LITERARY Posterior mean
BGROUP CGROUP −3 −2 −1 0 1 2 −3 −2 −1 0 1 2
LITERARY Posterior mean
MCEMGROUP CGROUP −3 −2 −1 0 1 2 −3 −2 −1 0 1 2
Information Posterior mean
BGROUP MCEMGROUP −3 −2 −1 0 1 2 −3 −2 −1 0 1 2
Information Posterior mean
BGROUP CGROUP −3 −2 −1 0 1 2 −3 −2 −1 0 1 2
Information Posterior mean
MCEMGROUP
CGROUP
Figure 6. Posterior means as generated by CGROUP and MCEMGROUP and plotted against the reference, BGROUP, for the NAEP 2003 grade 4 reading data.
compares the corresponding posterior standard deviations.
In contrast to the first example (that involving simulated data), there is no obvious large difference between the estimates of the posterior means or standard deviations (SDs) generated by CGROUP and MCEMGROUP. Both approaches differ slightly from the estimates generated by BGROUP. CGROUP overestimates the posterior standard deviation for larger values when compared to BGROUP. The graph comparing BGROUP and CGROUP posterior SDs shows this effect in a very similar way to what Thomas (1993) reports in his Figure 1.
MCEMGROUP, on the other hand, stays closer to BGROUP for large posterior standard deviations than CGROUP does. For small standard deviations, MCEMGROUP seems to produce slightly larger values than BGROUP, while the estimates generated by CGROUP and BGROUP agree better with smaller standard deviations.
Except for a few outliers, the posterior means as generated by MCEMGROUP agree nicely with the values produced by both BGROUP and CGROUP. There is no obvious
0.3 0.4 0.5 0.6 0.7 0.8 0.3 0.4 0.5 0.6 0.7 0.8 LITERARY Posterior SD BGROUP MCEMGROUP 0.3 0.4 0.5 0.6 0.7 0.8 0.3 0.4 0.5 0.6 0.7 0.8 LITERARY Posterior SD BGROUP CGROUP 0.3 0.4 0.5 0.6 0.7 0.8 0.3 0.4 0.5 0.6 0.7 0.8 LITERARY Posterior SD MCEMGROUP CGROUP 0.3 0.4 0.5 0.6 0.7 0.8 0.3 0.4 0.5 0.6 0.7 0.8 Information Posterior SD BGROUP MCEMGROUP 0.3 0.4 0.5 0.6 0.7 0.8 0.3 0.4 0.5 0.6 0.7 0.8 Information Posterior SD BGROUP CGROUP 0.3 0.4 0.5 0.6 0.7 0.8 0.3 0.4 0.5 0.6 0.7 0.8 Information Posterior SD MCEMGROUP CGROUP
Figure 7. Posterior standard deviations as generated by CGROUP and MCEM-GROUP and plotted against the reference, BMCEM-GROUP, for the 2003 NAEP reading assessment at grade 4 data.
indication of systematic over- or underestimation for the vast majority of subjects. The obvious outliers seen in these plots are subject to ongoing research aimed at optimizing and stabilizing the current implementation of MCEMGROUP. It will be investigated whether the stochastic integration failed to produce meaningful results due to poor coverage of the true posterior by the importance sampling distribution or whether these values are caused by small size of the importance sample, or for other less obvious reasons.
Table 4 shows the subgroup estimates and the corresponding standard deviations (in parenthesis) provided by the three methods.
There is hardly any differences between CGROUP and MCEMGROUP so far as subgroup estimates are concerned. However, when compared to BGROUP results, both CGROUP and MCEMGROUP slightly underestimate the subgroup means.
Table 4.
Comparison of Subgroup Estimates From BGROUP, CGROUP, and MCEMGROUP
Pop. BGROUP CGROUP MCEMGROUP
sub. Lit. Info. Lit. Info. Lit. Info.
Overall 0.025 (0.95) –0.003 (0.98) 0.020 (0.96) –0.008 (0.99) 0.020 (0.94) –0.003 (0.98) Male –0.093 (0.96) –0.065 (1.00) –0.100 (0.96) –0.072 (1.01) –0.099 (0.94) –0.066 (0.99) Female 0.147 (0.94) 0.060 (0.97) 0.143 (0.94) 0.057 (0.98) 0.142 (0.93) 0.061 (0.96) White 0.277 (0.87) 0.278 (0.89) 0.273 (0.87) 0.277 (0.90) 0.272 (0.86) 0.279 (0.89) Black –0.478 (0.92) –0.549 (0.92) –0.487 (0.92) –0.560 (0.93) –0.482 (0.91) –0.547 (0.91) Hisp. –0.402 (0.94) –0.493 (0.94) –0.409 (0.94) –0.503 (0.95) –0.407 (0.93) –0.492 (0.93) Asian 0.213 (0.93) 0.187 (0.97) 0.210 (0.94) 0.186 (0.98) 0.208 (0.92) 0.188 (0.96) AmInd –0.391 (0.94) –0.400 (0.94) –0.401 (0.95) –0.408 (0.95) –0.397 (0.92) –0.400 (0.93) Public –0.019 (0.96) –0.048 (0.99) –0.025 (0.96) –0.053 (1.00) –0.025 (0.94) –0.047 (0.98) Private 0.439 (0.83) 0.389 (0.87) 0.435 (0.83) 0.386 (0.88) 0.433 (0.82) 0.386 (0.86)
Note. Pop. sub. = Population subgroup, Lit. = Literary, Info. = Information. Hisp. = Hispanic.
Asian = Asian Pacific, AmInd = American Indian.
7 Conclusions and Future Work
CGROUP is the current operational method used in large scale assessments such as NAEP. Though CGROUP provides more accurate results than its predecessor (NGROUP), it is not without problems, as demonstrated by Thomas (1993); especially, CGROUP is found to inflate variance estimates for examinees with large posterior variances. As of now, there is no entirely satisfactory alternative to CGROUP. As this work shows, a stochastic EM method using importance sampling provides a viable alternative to CGROUP. Application of the importance sampling EM method to a simulated data set and a real data set reveals interesting facts.
The regression parameter estimates are extremely close for BGROUP (the gold standard), CGROUP (the current operational version of MGROUP used in NAEP, etc.), and the newly proposed MCEMGROUP. The estimates of conditional posterior means seem
to be quite close for CGROUP and MCEMGROUP. In the real data example, both of these set of estimates are very close to the BGROUP estimates. The MCEM algorithm produced a few outlying estimated conditional posterior means, especially in the real data example, that may be due to MC error; this issue needs further investigations. Statements about agreement between posterior standard deviations cannot be clearly made as of now. For the real data example, which has a large sample size, CGROUP provides inflated estimates of posterior SDs for examinees with large SDs, while the MCEMGROUP does not. However, MCEMGROUP results in a few outliers in the middle of the range of the SDs. Interestingly, the pattern is quite different for the simulated data, where the MCEMGROUP estimates of posterior SDs are bigger than those from CGROUP. Overall, the stochastic EM method performs slightly better than the CGROUP version of MGROUP.
One problem with the stochastic EM method is that it is very time-consuming. For example, for our real data example (2003 reading data), while the BGROUP takes 8 hours and CGROUP takes 4 hours on a Pentium IV computer with 2.2 Ghz, MCEMGROUP takes about 48 hours, which is longer than CGROUP by a factor of 12. However, with the advent of increasingly fast computers, application of MCEMGROUP will become more feasible; at least this method can be used to get a second opinion right now.
It is possible to improve the efficiency of the importance sampling EM method even
further. Let us consider estimation of E(θ|X,Y,Γt,Σt), which is given in (15). Let us
denoteθ0 to be the mode of p(θ|X,Y,Γt,Σt). We have,
E(θ|X,Y,Γt,Σt) ≡
Z
θp(θ|X,Y,Γt,Σt)dθ ≡
Z
θexp{u(θ)}dθ, for exp{u(θ)} ≡p(θ|X,Y,Γt,Σt)
≡ Z θexp{u(θ0) + 1 2(θ−θ0) 0u00(θ 0)(θ−θ0) + ∆(θ)}dθ as u0(θ0)≡0 ≡ eu(θ0) Z θexp 1 2(θ−θ0) 0u00(θ 0)(θ−θ0) exp(∆(θ))dθ where ∆(θ) =u(θ)−u(θ0)− 12(θ−θ0)0u00(θ0)(θ−θ0)·
The first term under exponentiation in the last step above forms a normal
distribution and ∆(θ) is much less variable (and close to zero) over the range of θ than is
should lead to more stable results and the importance sample size to reach the same level is accuracy and precision will be greatly reduced.
There are a number of issues that need to be answered before the method is implemented in practice. First, a study of the Monte Carlo error in estimating the conditional posterior means and variances is needed. Second, we would like to study the effect of different importance sample sizes on the parameter estimates obtained, mainly to find out what is the optimum sample size required to obtained reasonable accuracy in the estimation process. Third, the issue of convergence of the EM algorithm used in MCEMGROUP needs to be explored further. Fourth, the optimal choice of the degrees
of freedom of the t importance sampling density is an open question and needs further
investigation. Finally, it will be of interest to use adaptive importance sample size, as in Booth and Hobert (1999), who start with a small sample size, but gradually increase the size.
References
Beaton, A. (1987). The NAEP 1983-84 technical report. Princeton, NJ: ETS.
Bauer, H. (1972)Probability theory and elements of measurement theory. New York: Holt,
Rinehart, and Winston.
Beaton, A., & Zwick, R. (1992). Overview of the National Assessment of Educational
Progress. Journal of Educational and Behavioral Statistics, 17, 95-109.
Booth, J. G., & Hobert, J. P. (1999). Maximizing generalized linear mixed model
likelihoods with an automated Monte Carlo EM algorithm. Journal of the Royal
Statistical Society, Series B, 61(1), 265-285.
Broniatowski, M., Celeux, G., & Diebolt, J. (1983). Reconnaissance de melanges de
densites par un algorithme d’apprentissage probabiliste. Data Analysis and
Informatics 3,359-373.
Celeux, G., & Diebolt, J. (1985). The SEM algorithm: A probabilisitic teacher algorithm
derived from the EM algorithm for the mixture problem. Computational Statistics, 2,
73-82.
Clarkson, D. B., & Gonzalez, R. (2001). Random effects diagonal metric multidimensional
scaling models. Psychometrika, 66(1), 25-43.
Cochran, W. G. (1977). Sampling techniques. New York: John Wiley & Sons.
Cohen, J. D., & Jiang, T. (1999). Comparison of partially measured latent traits across
normal populations. Journal of the American Statistical Association, 94, 1035-1044.
Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from
incomplete data via the EM algorithm. Journal of the Royal Statistical Society, B,
39, 1-38.
von Davier, M. (2003). Comparing conditional and marginal direct estimation of subgroup
distributions (ETS RR-03-02). Princeton, NJ: ETS.
von Davier, M., & Yu, H. T. (2003). Recovery of population characteristics from sparse
matrix samples of simulated item responses. Paper presented at the annual meeting of the National Council on Measurement in Education, Chicago.
von Davier, M., & Yon, H. (2004). A conditioning model with relaxed assumptions. Paper
Education, San Diego, CA.
Fox, J. P. (2003). Stochastic EM for estimating the parameters of multilevel IRT model. British Journal of Mathematical and Statistical Psychology, 56, 65-81.
Gelman, A., Carlin, J. B., Stern, H. S., & Rubin, D. B. (2003). Bayesian data analysis.
New York: Chapman & Hall.
Geweke, J. (1989). Bayesian inference in econometric models using Monte Carlo
integration. Econometrica, 57, 1317-1339
Gilks, W. R., Richardson, S., & Spiegelhalter, D. J. (1996). Markov chain Monte Carlo in
practice. London: Chapman & Hall.
Greene, W. H. (2002) Econometric analysis(4th ed.). Upper Saddle River, NJ: Prentice
Hall.
Johnson, M. S. (2002). A Bayesian hierarchical model for multidimensional performance
assessments. Paper presented at the annual meeting of the National Council on Measurement in Education, New Orleans, LA.
Johnson, M. S., & Jenkins. F. (in press). A Bayesian hierarchical model for large-scale
educational surveys: An application to the National Assessment of Educational Progress. Princeton, NJ: ETS.
Kass, R., & Steffey, D. (1989). Approximate Bayesian inference in conditionally
independent hierarchical models. Journal of the American Statistical Association, 84,
717-726.
Kirsch, I. (2001) The International Adult Literacy Survey (IALS): Understanding what
was measured(ETS RR-01-25). Princeton, NJ: ETS.
Lee, S. Y., Song X. Y., & Lee, J. C. K. (2003). Maximum likelihood estimation of
nonlinear structural equation models with ignorable missing data. Journal of
Educational and Behavioral Statistics, 28, 111-124.
Martin, M. O., & Kelly D. L. (1996). TIMSS Technical Report: Vol. I. Design and
development. Chestnut Hill, MA: Boston College.
McCulloch, C. E. (1994). Maximum likelihood variance components estimation for binary
data. Journal of the American Statistical Association, 89,330-335.
empirical investigation of bridge sampling. Journal of the American Statistical Association, 91,1254-1267
Mislevy, R. (1984). Estimating latent distributions. Psychometrika, 44,358-381.
Mislevy, R., Johnson, E., & Muraki, E. (1992). Scaling procedures in NAEP. Journal of
Educational and Behavioral Statistics, 17, 131-154.
Mislevy, R. (1985). Estimation of latent group effects. Journal of the American Statistical
Association, 80,993-997.
Mullis, I. V. S., Martin, M. O., Gonzalez, E. J., & Kennedy, A. M. (2003). PIRLS 2001
international report: IEA’s study of reading literacy achievement in primary schools. Chestnut Hill, MA: Boston College.
Thomas, N. (1993). Asymptotic corrections for multivariate posterior moments with
factored likelihood functions. Journal of Computational and Graphical Statistics,
2(3), 309-322.
Wei, G. C. G., & Tanner, M. (1990). A Monte Carlo implementation of the EM algorithm
and the poor man’s data augmentation algorithms. Journal of the American
Statistical Association,85,699-704
Zellner, A. (1962). An efficient method for estimating seemingly unrelated regressions and
Notes