• No results found

Application of the Stochastic EM Method to Latent Regression Models

N/A
N/A
Protected

Academic year: 2021

Share "Application of the Stochastic EM Method to Latent Regression Models"

Copied!
40
0
0

Loading.... (view fulltext now)

Full text

(1)

Research

Report

Application of the

Stochastic EM Method to

Latent Regression Models

Matthias von Davier

Sandip Sinharay

(2)
(3)

Application of the Stochastic EM Method to Latent Regression Models

Matthias von Davier and Sandip Sinharay ETS, Princeton, NJ

(4)

ETS Research Reports provide preliminary and limited dissemination of ETS research prior to publication. To obtain a PDF or a print copy of a report, please visit:

(5)

Abstract

The reporting methods used in large scale assessments such as the National Assessment of Educational Progress (NAEP) rely on a latent regression model. The first component of the model consists of a p-scale IRT measurement model that defines the response probabilities on a set of cognitive items in p scales depending on a p-dimensional latent trait variable

θ = (θ1, . . . θp). In the second component, the conditional distribution of this latent trait

variable θ is modeled by a multivariate, multiple regression on a set of predictor variables, which are usually based on student, school and teacher variables in assessments such as NAEP.

In order to fit the latent regression model using the maximum (marginal) likelihood estimation technique, multivariate integrals have to be evaluated. In the computer program MGROUP used by ETS for fitting the latent regression model to data from NAEP and other sources, the integration is currently done either by numerical quadrature (for problems up to two dimensions) or by an approximation of the integral. CGROUP, the current operational version of the MGROUP program used in NAEP and other assessments since 1993, is based on Laplace approximation that may not provide fully satisfactory results, especially if the number of items per scale is small.

This paper examines the application of stochastic expectation-maximization (EM) methods (where an integral is approximated by an average over a random sample) to NAEP-like settings. We present a comparison of CGROUP with a promising implementation of the stochastic EM algorithm that utilizes importance sampling. Simulation studies and real data analysis show that the stochastic EM method provides a viable alternative to CGROUP for fitting multivariate latent regression models.

(6)

Acknowledgements

The authors thank John Mazzeo, Shelby Haberman, Alina von Davier, Neal Thomas, Andreas Oranje, Ying Jin, and Matthew Johnson for useful advice, Steve Isham for help with the data sets used in the analysis, and Kim Fryer for help with proofreading.

(7)

1 Introduction

National Assessment of Educational Progress (NAEP), the only regularly

administered and congressionally mandated national assessment program (see, e.g., Beaton & Zwick, 1992), is an ongoing survey of the academic achievement of the school students in the United States in a number of subject areas such as reading, writing, mathematics, and the like. It is administered by the National Center for Educational Statistics (NCES), a part of the U.S. Department of Education, to a selected sample of students in Grades 4, 8, and 12. A document called Nation’s Report Card reports the results of NAEP for a number of academic subjects on a regular basis. A comparison is provided to the previous assessments in each subject area. The academic achievement is described as average student proficiency for all students in the U.S. and as the percentage of students attaining fixed levels of proficiency (the levels being defined by the National Assessment Government Board as what students should know at each grade level) in different subjects. In addition to producing these numbers for the nation as a whole, NAEP reports the same results for different subpopulations (based on gender, race, school-type, etc.) of the student population.

NAEP is prohibited by law from reporting results for individual students, schools, or school districts and is designed to obtain optimal estimates of subpopulation characteristics rather than those of individual performance. To assess national performance in a valid way, NAEP must sample a wide and diverse body of student knowledge. To avoid the burden involved with presenting each item to every examinee, NAEP selects students randomly from designated grade and age populations (first, a sample of schools are selected according to a detailed stratified sampling plan, as mentioned in, e.g., Beaton & Zwick, 1992, and then students are sampled within the schools). NAEP then administers one of many possible booklets of items to each student. This process is sometimes referred to as “matrix sampling of items.” For example, the 2000 NAEP in mathematics assessment at grade 4 contained 173 items split across 26 booklets. Each item was developed to measure one of five subscales (a) Number and Operations (b) Measurements (c) Geometry (d) Data Analysis, and (e) Algebra. An item can be multiple-choice or constructed-response. Background (demographic) information are collected on the students through questionnaires that are

(8)

filled out by students, teachers, and school administrators. For example, the questionnaires collected information on 381 background variables for each student in the assessment. The above description clearly shows that NAEP’s design and implementation are fundamentally different from those of a large scale testing program.

NAEP reports were originally envisaged as simple lists of percents correct to individual survey items (Mislevy, Johnson, & Muraki, 1992), in the population as a whole and in subpopulations of particular interest. However, it was realized later that this approach has severe limitations (e.g., comparison is limited to groups of items common to student subpopulations), and major features of the detailed results from hundreds of items and hundreds of background variables could not be effectively communicated without some kind of statistical modeling. As a result, starting in 1984, NAEP reporting methods used a statistical model consisting of two components: (a) an item response theory (IRT) model, and (b) a linear regression model (see, e.g., Beaton, 1987; Mislevy et al., 1992). Other large scale educational assessments such as the International Adult Literacy Study (IALS; Kirsch, 2001), Trends in Mathematics and Science Study (TIMSS; Martin & Kelly, 1996), and Progress in International Reading Literacy Study (PIRLS; Mullis, Martin, Gonzalez, & Kennedy, 2003) also adopted essentially the same model.

This model is referred to as either the conditioning model, multilevel IRT model,

or latent regression model. An algorithm for estimating the parameters of this model is implemented in the MGROUP set of programs, an ETS product. The first component of the model (the IRT part) defines the responses of examinees as a set of cognitive items

to be dependent on a p-dimensional latent trait/proficiency vector θ = (θ1, . . . θp). In the

second component (linear regression part), the distribution of the proficiency variable θ is

modeled by a multivariate, multiple regression on a set of background/predictor variables, which contain information on respective students, schools, and teachers, etc.

In order to compute maximum (marginal) likelihood estimates of the parameters of

this latent regression model, the proficiency θ (usually multivariate) has to be integrated

out; this makes estimation for the model problematic. Mislevy (1984, 1985) shows that the maximum likelihood estimates can be obtained using an expectation-maximization (EM) algorithm (Dempster, Laird, & Rubin, 1977). The algorithm requires the values of the

(9)

posterior means and the posterior variances for the examinee proficiency parameters. For problems up to two dimensions, the integration is computed using numerical quadrature by the BGROUP version (Beaton, 1987) of the MGROUP program in operational settings in NAEP. For higher dimensions, no numerical integration routine is available operationally (although it may now be possible to perform numerical integration for higher dimensions; work on that is in progress), and an approximation of the integral is used. The CGROUP version of MGROUP, the current operational procedure used in NAEP and other assessments since 1993, is based on the Laplace approximation (Kass & Steffey, 1989) that ignores the higher order derivatives of the examinee posterior distribution and may not provide accurate results, especially in higher dimensions (e.g., a graphical plot for a data example in Thomas, 1993, shows that CGROUP overestimates the high examinee posterior variances).

A number of extensions to the CGROUP version of MGROUP or proposals for alternative estimation methods have been suggested. The MCMC estimation approach of Johnson and Jenkins (in press) and Johnson (2002) focuses on implementing a fully Bayesian method for joint estimation of IRT and regression parameters. The direct estimation for normal populations (Cohen & Jiang, 1999) tries to replace the multiple regression by a model that is essentially equivalent to a multigroup IRT model with some additional restrictions on item parameters and group specific variances. The YGROUP version (von Davier & Yu, 2003) of MGROUP implements seemingly unrelated regressions (SUR; Zellner, 1962) and offers generalized least squares estimation of the latent regression model (von Davier & Yon, 2004). None of these alternatives have been entirely satisfactory. The MCMC estimation approach could not be implemented for the operational model with multidimensional proficiencies and the full set of background variables because of the time-consuming nature of MCMC estimation. The direct estimation approach as proposed by Cohen and Jiang (1999) leads to biased estimates even in very small data examples with nonnormal marginal ability distributions (von Davier, 2003). The SUR approach to estimating multivariate regressions is identical to generalized least squares (GLS) if all equations have the same regressors and observed dependent variables (Greene, 2002), a result that cannot be expected to hold in latent regressions. Therefore, there is scope of

(10)

further research in this area.

Stochastic EM methods (SEM; Broniatowski, Celeux, & Diebolt, 1983; Celeux & Diebolt, 1985) have been suggested for use in EM algorithms where the E-step is hard to compute/approximate analytically or numerically. In an application of a stochastic EM algorithm, the E-step is handled by simulation. Depending on the nature of the simulation used in the E-step, a stochastic EM method can be a Monte Carlo EM (e.g., Wei & Tanner, 1990), Markov chain Monte Carlo EM (e.g., McCulloch, 1994), rejection sampling EM (e.g., Booth & Hobert, 1999), importance sampling EM (e.g., Booth & Hobert, 1999), etc. Important applications of the algorithm in psychometrics include Fox (2003), Lee, Song, and Lee (2003), Clarkson and Gonzalez (2001) and Meng and Schilling (1996), each of which employ the Markov chain Monte Carlo algorithm (e.g., Gilks, Richardson, & Spiegelhalter, 1996) to simulate in the E-step.

The importance sampling EM (e.g., Booth & Hobert, 1999) is a stochastic EM method where each integral in the E-step is approximated by an average over a random sample.

This paper examines the prospect of application of the importance sampling EM method to fit the latent regression model in NAEP-like settings. We also present a comparison of the results from the importance sampling EM algorithm with those from the CGROUP version of MGROUP, which is the technique currently used in NAEP. For a low dimensional real data example, the results from the suggested method are also compared to the BGROUP version of MGROUP, which performs exact numerical integration and is the gold standard method in such settings. Simulation studies and real data analysis show that the stochastic EM method provides a viable alternative to CGROUP for fitting latent regression models.

2 NAEP Model, Estimation, and the Current MGROUP Method

This section first states the two component statistical model used in NAEP. A brief description of the EM algorithm, which will be helpful later, follows. The next subsection describes the current NAEP estimation process and the MGROUP program used in ETS.

(11)

2.0.1 The Latent Regression Model

NAEP and other similar large scale assessments implement a latent regression

model utilizing an IRT measurement model. Assume that the unique p-dimensional latent

proficiency vector for examinee i is θi= (θi1, θi2, . . . θip)0. In NAEP, p could be 2, 3, or 5.

Let us denote the response vector to the test items for examinee i as

yi = (yi1,yi2, . . . ,yip), where, yik, a vector of responses, contributes information about θik. The likelihood for an examinee is given by

f(yi|θi) = p

Y

q=1

f1(yiq|θiq)≡L(θi;yi)· (1)

The terms f1(yiq|θiq) above follow a univariate IRT model, usually with three-parament

logistic (3PL) or generalized partial-credit model (GPCM) likelihood terms. For reasons to be discussed later, the dependence of (1) on the item parameters is suppressed.

Suppose xi = (xi1, xi2, . . . xim) are m fully measured demographic and educational

characteristics for the examinee. Conditional on xi, the examinee proficiency vector θi is

assumed to follow a multivariate normal prior distribution, that is,θi|xi ∼N(Γ0xi,Σ). The

mean parameter matrix Γ and the non negative definite variance matrix Σare assumed to

be the same for all examinee groups.

Under this setup, L(Γ,Σ|Y,X), the (marginal) likelihood function for (Γ,Σ)

based on the data (X,Y), is given by

L(Γ,Σ|Y,X) = n Y i=1 Z f1(yi1|θi1). . . f1(yip|θip)φ(θi|Γ0xi,Σ)dθi, (2)

wherenis the number of examinees, and φ(.|., .) is the multivariate normal density function.

2.0.2 EM Algorithm

The EM algorithm (e.g., Dempster et al., 1977) is commonly used to obtain maximum likelihood estimates or posterior modes in problems with missing data. In fact, the algorithm is applicable in a broad range of situations where probability models can be reexpressed as ones on augmented parameter spaces and the added parameters can be thought of as missing data. The EM algorithm is relevant for mixed models because

(12)

the random effects may be viewed as missing data and the algorithm may be used to find maximum likelihood estimates for these models.

Suppose that given an unknown parameter vectorω, data y follows a distribution

f(y;ω) and that

y0 = (y0

obs,y0mis),

where yobs is the observed data and ymis is the missing data. Suppose the objective is

to obtain the maximum likelihood estimate of ω, i.e., the value of ω that maximizes the

observed data likelihood

f(yobs|ω) = Z

f(yobs,ymis|ω)dymis.

The EM algorithm goes through a succession of E-steps and M-steps starting from an initial

value ω(0) of the parameter vectorω. In the (t+ 1)-th E-step, one maximizes

Q(ω|ω(t)) =E[log{f(yobs,ymis|ω)}|yobs,ω(t)], (3) the expected value of the complete data loglikelihood, where the expectation is with respect

to the distribution of ymis given yobs and ω(t). In the corresponding M-step, one maximizes

Q(ω|ω(t)) computed in the preceding E-step with respect to ω to get ω(t+1). The algorithm

is said to have converged when two successive iterations give very similar values of the

parameter vector. It is also important to monitor the value of the loglikelihood f(yobs|ω).

The EM algorithm is guaranteed to converge to a local maximum—so it is customary to start the iterations at many points in the parameter space and then to compare the values of the likelihood at all of the local maxima. This method may be very slow to converge if the proportion of missing data is high.

For simple problems, one may be able to do the averaging in (3) analytically. However, in many practical problems (e.g., in NAEP estimation), the expectation cannot be computed analytically; then one may need to use an approximation technique or a simulation technique to compute the expectation.

2.0.3 NAEP Estimation Process and the MGROUP Program

NAEP uses a three-stage estimation process for fitting the above mentioned

(13)

model (consisting of 3PL and GPCM terms) of the form in (1) to the examinee response data and estimates the item parameters. The prior distribution used in this step is not

θi|xi ∼ N(Γ0xi,Σ) as described above, but is θi ∼ N(0,I), that is, the subscales are

assumed to be independent a priori. The second stage, conditioning, assumes that the item

parameters are fixed at the estimates found in scaling and fits the model in (2) to the

data, that is, estimates Γ and Σ as a first part. In the second part of the conditioning

step, plausible valuesfor all examinees are obtained using the parameter estimates obtained

in scaling and the first part of conditioning—the plausible values are used to estimate examinee subgroup averages. The third stage of the NAEP estimation process, called variance estimation, estimates the variances corresponding to the examinee subgroup averages using a jackknife approach (see, e.g., Johnson & Jenkins, in press). Our research

will focus on the conditioningstep and assume that the scaling has already been done (i.e.,

the item parameters are fixed); this is the reason we suppress the dependence of (1) on the item parameters.

Because we will be concerned with the conditioning step, the remaining part of

the section provides a more detailed discussion of it. The first objective of this step is to

estimate Γand Σfrom the data. If the θis were known, the maximum likelihood estimators

of Γ and Σwould be ˆ Γ = (X0X)−1X0        θ0 1 θ0 2 . . . θ0 n        , (4) ˆ Σ = 1 n X i (θiΓ0x i)(θi −Γ0xi)0· (5)

However, θis are actually unknown. Mislevy (1984, 1985) shows that the maximum

likelihood estimates of Γand Σunder unknown θis can be obtained using an EM algorithm

(Dempster et al., 1977). The EM algorithm iterates through a number of expectation steps

(14)

value of the parameters in the tth M-step, is obtained as: Γt+1 = (X0X)−1X0        f θ0 1t f θ0 2t . . . f θ0 nt        , (6) Σt+1 = 1 n " X i Var(θi|X,Y,Γt,Σt) + X i (θfit−Γ0t+1xi)(θfit−Γ0t+1xi)0 # , (7)

where θfit =E(θi|X,Y,Γtt) is the posterior mean for examinee i given the preliminary

parameter estimates of iteration t. The process is repeated until convergence of the

estimates Γ and Σ.

Formulae (6) and (7) require the values of the posterior means E(θi|X,Y,Γt,Σt)

and the posterior variances Var(θi|X,Y,Γt,Σt) for the examinees. Correspondingly, the

tth E-step computes the two required quantities for all the examinees. The MGROUP set

of programs at ETS perform the EM algorithm mentioned above.

The MGROUP program consists of two primary controlling routines called PHASE1 and PHASE2. The former does some preliminary processing while the latter directs the EM iterations. There are different versions of the MGROUP program depending on the method used to perform the E-step in PHASE2: BGROUP using numerical quadrature, NGROUP (Mislevy, 1985) using Bayesian normal theory, CGROUP (Thomas, 1993) using Laplace approximations, YGROUP (von Davier & Yu, 2003) using SUR, and so on.

The BGROUP version of MGROUP, applied when the dimension of θi is less

than or equal to two, applies numerical quadrature to approximate the integral. When

the dimension of θi is larger than two, NGROUP, CGROUP, or YGROUP may be used.

Note that N-group and Y-group are not used for operational purposes for the following reasons: N-group assumes normality of both the likelihood and the prior and may produce biased results if these assumptions are not met (Thomas, 1993). YGROUP performs the estimation independently by implementing seemingly unrelated regressions (Zellner, 1962), which is useful for generating starting values, but does not capitalize on the covariances

between the p components of the latent trait. This may be inappropriate, for example, if

(15)

or all subjects.

The following subsection provides details about CGROUP, the approach currently used operationally in several large scale assessments including NAEP. This approach is based on approximation of posterior moments of a multivariate distribution using Laplace approximation.

CGROUP E-step. The CGROUP version of MGROUP uses the Laplace approximation of the posterior mean and variance, that is, uses

E(θj|X,Y,Γt,Σt) ≈ θˆj,mode− 1 2 p X r=1 GjrGrrˆh(3)r , j = 1,2, . . . p, (8) Cov (θj, θk|X,Y,Γt,Σt) ≈ Gjk− 1 2 p X r=1 (GjrGkrGrr) ˆh(4)r + 1 2 p X r=1 r X s=1 1− 1 2I(r =s) ˆ h(3)r ˆh(3)s Grs× (GjsGksGrr+GjsGkrGrs+GjrGksGrs+ GjrGkrGss) , j, k = 1,2, . . . p, (9) where, ˆ

θj,mode = jthcomponent of the posterior mode of θ,

h = log[f(y|θ)φ(θ|Γ0 tx,Σt)] Gjk = ∂h ∂θr∂θs ˆ θmode −1 jk , (10) ˆ

hnr = nth pure partial derivative withrespectto θr, evaluated at ˆθmode,

and I(r=s) is 1 if r =s and 0 otherwise.

Details about the formulae (8) and (9) can be found in Thomas (1993, pp.

316-317).1 The Laplace method does not provide an unbiased estimate of the quantity it

is approximating and may provide inaccurate results if higher order derivatives (that the Laplace method assumes to be equal to zero) are not negligible. The error of approximation

for each component of the mean and covariance of θi is of orderO(1

k2) (e.g., Kass & Steffey,

1989), where k is the number of items measuring skill corresponding to the component.

(16)

NAEP is not too large (making k rather small), the errors in the Laplace approximation

may become nonnegligible, especially for high-dimensional θis. Further, if the posterior

distribution of θis is multimodal (which is not impossible, especially for a small number of

items), the method can perform poorly. Therefore the CGROUP version of MGROUP is not entirely satisfactory. Figure 1 in Thomas (1993), where the posterior variance estimates of 500 randomly selected examinees using the Laplace method and exact numerical

integration for two-dimensional θi are plotted, shows that the Laplace method provides

inflated variance estimates for examinees with large posterior variance (see also Section 6

in this report). The departure maybe more severe for θis in higher dimensions.

3 The Suggested Stochastic EM Method

Before going into the details of the suggested technique, we provide a brief description of the importance sampling method (e.g., Geweke, 1989.).

3.1 Importance Sampling

The Radon-Nikodym theorem (e.g., Bauer, 1972) forms the basis for importance

sampling. This theorem states that a measure M over Ω can be expressed in terms of

another measure N over Ω

M(A) = Z A f dN Z A m(ω)dω = Z A n(ω)f(ω)dω

if N(A) = 0 M(A) = 0 for all A. The function f is referred to as the Radon-Nikodym

derivative and is sometimes written as f = ∂M∂N. In the case that m, n are probability

densities over Ω =Rp the theorem implies

Z Ω f(ω)ωn(ω)dω= Z Ω ωm(ω)dω, which is equivalent to EM(ω) =EN(f ω).

This result is applied in importance sampling to approximate intractable integrals.

Suppose we want to compute the expected value of a function g(ω) where the

expectation is taken with respect to a probability density f(ω). We can express the

expectation of g(ω) as Ef{g(ω)}= Z ω g(ω)f(ω)dω= Z ω g(ω)f(ω) h(ω) h(ω)dω (11)

(17)

for any probability density h(ω) defined on the same sample space that is zero only if f(ω)

is zero. Now, if we can generate a random sample ˜ω1, ˜ω2, . . ., ˜ωn from the distribution

with density h(ω), we can approximate Ef{g(ω)}, using (11), as

Ef{g(ω)} ≈ 1 n n X i=1 g( ˜ωi)f( ˜ωi) h( ˜ωi) · (12) The standard error corresponding to the above estimate is given by dividing the standard

deviation of the importance ratios g( ˜ωi)f( ˜ωi)

h( ˜ωi) by √

n. The standard deviation mentioned

above determines the quality of the approximation. The density h(ω) is called the

importance sampling density.

Ifh(ω) is chosen so that the importance ratio g(ωh()ωf()ω) is roughly constant over the

range of possible values of ω (which will make the standard deviation of the importance

ratios small), fairly accurate approximation of the integral may be obtained. In particular,

h(ω) should have heavier tails than the product g(ω)f(ω) since otherwise there may be a

fewωi’s for which h(ω) will be much smaller than g(ω)f(ω) and the estimate will blow up.

The Radon-Nikodym theorem requires RAh = 0RAg = 0 to hold, which can be stated in

an approximate fashion as: Wherever h is zero, g needs to be zero. This is the theoretical

justification for the heavier tail requirement ofh over gf.

The main advantages of importance sampling are that it provides an unbiased estimate of the integral and the accuracy can be monitored by examining the standard

deviation of the g( ˜ωi)f( ˜ωi)

h( ˜ωi) values.

In some situations, the researcher knows a distribution up to a constant only, but still needs to compute the moments of the distribution. For example, often that all that is available in a Bayesian analysis is a multiple of the posterior (by multiplying the likelihood and the prior), but not the posterior itself. Importance sampling can be used in

this situation as well. Suppose, in the above setup, one does not know f(ω), but does know

q(ω)≡c.f(ω). The expectation of a function g(ω) is given by

Ef{g(ω)}= R ωg(ω)q(ω)dω R ωq(ω)dω ·

(18)

The numerator in the right-hand side of above is approximated by Z ω g(ω)q(ω)dω = Z ω g(ω)q(ω) h(ω) h(ω)dω≈ 1 n n X i=1 g( ˜ωi)q( ˜ωi) h( ˜ωi)

while the denominator is approximated by Pni=1 q( ˜ωi)

h( ˜ωi) (see, e.g., Gelman, Carlin, Stern, &

Rubin, 2003, pp. 342-343). Geweke (1989) shows that the estimate above converges almost surely to the estimand under weak assumptions and a central limit theorem (establishing

that n1/2(E d

f{g(ω)} −Ef{g(ω)}) → N(0, τ2), where τ2 can be estimated consistently)

applies under stronger assumptions. The estimation of ratio of two quantities by the ratio of estimates of them is known as ratio estimation. For more details about ratio estimation, see, for example, Cochran (1977).

3.2 Importance Sampling in the E-step of the MGROUP Program

This paper suggests approximating the posterior expectation and variance of the

examinee proficiencies θi in (6) and (7) using the importance sampling method.

The posterior distribution ofθi, denoted asp(θi|X,Y,Γt,Σt), is given by

p(θi|X,Y,Γt,Σt)∝f(yi1|θi1). . . f(yip|θip)φ(θ|Γ0txi,Σt) (13)

using (2). The proportionality constant in (13) is a function of yi, Γt, and Σt. Let us

denote

q(θi|X,Y,Γt,Σt)≡f(yi1|θi1). . . f(yip|θip)φ(θ|Γ0txi,Σt). (14)

We drop the subscripti for convenience for the rest of the section and let θ denote

the proficiency of an examinee.

We have to compute the mean and variance of p(θ|X,Y,Γt,Σt), that is, the

quantities E(θ|X,Y,Γt,Σt)≡ Z θp(θ|X,Y,Γt,Σt)dθ, and. (15) Var(θ|X,Y,Γt,Σt)≡ Z (θE(θ|X,Y,Γt,Σt))(θ−E(θ|X,Y,Γt,Σt))0 p(θ|X,Y,Γt,Σt)dθ (16)

(19)

Using ideas from the above description of the importance sampling method, if

we can generate a random sample θ1, θ2, . . ., θn from a distribution h(θ) approximating

p(θ|X,Y,Γt,Σt) reasonably, we can approximate (15) by the ratio of

1 n n X j=1 θjq(θj|X,Y,Γtt) h(θj) and 1 n n X j=1 q(θj|X,Y,Γt,Σt) h(θj) . Similarly, we can approximate (16) as the ratio of 1 n n X j=1 (θj E(θj|X,Y,Γtt))(θj E(θj|X,Y,Γtt))0q(θj|X,Y,Γ t,Σt) h(θj) and 1 n n X j=1 q(θj|X,Y,Γt,Σt) h(θj) .

Following Booth and Hobert (1999), who apply the importance sampling EM

method for fitting generalized linear mixed models, we use a multivariate t importance

sampling density. Booth and Hobert (1999) argue that in general, a multivariate t density

is a better choice as an importance sampling density than a normal density. One reason is that the former is heavy-tailed while the latter is not, and that a good importance sampling

density should be heavy-tailed when the density it approximates may be so; also, the t

density is easy to sample from. The degrees of freedom of the t density is taken to be four

because low degrees of freedom of a t density ensures that it has a heavy tail. (The choice

of optimal degrees of freedom is an issue requiring further investigation.)

The mean and variance of the t importance sampling density are taken as input

from a nonstochastic version of MGROUP. In our implementation, the YGROUP version (von Davier & Yu, 2003) is used, which conveniently provides input for the importance sampling density in the stochastic EM algorithm.

It is also possible to estimate variances of the parameter estimates obtained from the EM algorithm as suggested by, for example, Booth and Hobert (1999). However, we are

(20)

more interested in the estimation of the latent regression parameters in this work and do not delve into the issue here.

Note that a number of studies have successfully applied other forms of stochastic EM methods, such as rejection sampling EM (Booth & Hobert, 1999) and Markov chain Monte Carlo EM, or MCMC-EM, (e.g., McCulloch, 1994), but we do not explore those approaches, mainly because of the difficulties in their application to our problem.

4 MCEM in MGROUP

This section describes the implementation of MCEMGROUP (MCEM is short for Markov chain Monte Carlo EM), the version of MGROUP that uses importance sampling to generate the necessary statistics in the E-Step. Because of the presence of a stochastic component in the E-step, this technique is also called stochastic EM algorithm.

Following the notes on the implementation, differences between assessing

convergence in nonstochastic optimization and in stochastic optimization algorithms will be discussed.

4.1 Implementing MCEMGROUP

The MCEM algorithm was integrated into YGROUP (von Davier, 2003), which is based on BGROUP and implements SUR (Zellner, 1962) and various other modifications.

Following Booth and Hobert (1999), importance sampling is used in the E-step to generate the statistics (here, posterior means, residual components, and measurement error components) necessary to perform the EM iterations for the latent regression.

The implementation uses importance sampling, employing a multivariate t (as in

Booth & Hobert, 1999). The MCEMGROUP version as implemented in YGROUP uses the SUR posterior means and the posterior residual (co-)variance matrix as fast and efficient means to compute the mean and variance of the importance sampling density.

Like in all versions of MGROUP, optional starting values for the latent regression

(Γ,Σ) can be provided but are not essential. As an alternative, initial iterations with a

nonstochastic E-step of one of the other versions can be carried out to generate starting values. The initial sample size of the importance sample can be chosen to be much smaller

(21)

than the final sample size, in order to speed up initial iterations where individual accuracy is less important than approaching a good starting value for the aggregate parameters. The examples below use 500 as initial importance sample size and 6,000 as final importance sample size.

4.2 Determining Convergence of the MCEM Algorithm

When applying a stochastic EM algorithm, it is possible that the EM algorithm has not converged, but appears so after an EM step (i.e., updated parameters and likelihoods do not change much from the previous step) because of an unlucky random sample.

One should therefore use a stricter convergence criterion, or a larger variety of criteria that all have to be met simultaneously, in order to ensure the convergence of the algorithm. We monitor the likelihood function and the parameter vector for convergence and conclude that convergence has occurred only when the relative change in both these

quantities are less than in five successive iterations.

The likelihood increases over the iterations in a nonstochastic EM algorithm (see, e.g., Dempster et al., 1977) while it may not be so for a stochastic EM algorithm because of Monte Carlo error. In order to assess convergence in stochastic optimization algorithms, the inevitable nonmonotonicity of these algorithms has to be taken into account. Each E-step in our implementation employs a different set of points (the importance sample) in the multivariate quadrature grid in order to perform the integration. This means that the likelihood of each observed response vector may vary, and hence, the likelihood of the whole sample may vary even if all other parameters are unchanged, from one importance sample to the next.

Therefore, in our implementation, we monitor absolute changes in the log likelihood,

the regression parameters Γ, and the residual (co-)variance matrix Σsimultaneously. The

maximum absolute changes are checked against user-defined stopping criteria, which are used to decide whether a significant change has occurred that demands further iterations towards the optimum.

The algorithm stops if, and only if, all three absolute changes are smaller than user-defined numbers. In addition to that, the absolute changes of previous

(22)

iterations are integrated with the current change by using the following approach:

For cycle t, define the average absolute maximum change (AAM C) as AAM Cx,t =

p×AM Cx,t+ (1−p)×AAM Cx,t1, which averages the current absolute maximum change

(AM C) and the previous average absolute maximum change (AAM C).

The x denotes a parameter (vector) or a function of parameters (for example, the

maximum change in regression parameters in our case). t denotes the current iteration

(cycle) of the algorithm; t1, the previous cycle; and 0 (1p)1.0, the weight of the

averaged previous cycles criterion.

This criterion ensures that, if p < 1, more than the current change (which might

be small due to the stochastic nature of the algorithm) is taken into account when deciding

whether to stop iterations. If p = 1, we have a regular (no memory) stopping criterion,

which stops whenever the current iteration alone meets the stopping rule.

In the current implementation, the stopping rule fades out the past absolute

changes rather slowly. We found that p = 0.4 and the AAMC bound of 0.045 for relative

change in likelihood, regression weights, and variances works reasonably well. This choice balances infinite iterations against premature termination of the algorithm in the examples

presented here. The rule for stopping the algorithm is that AAM Cx,t < 0.045 has to be

satisfied for termination, a maximum change bound similar to what Booth and Hobert (1999) report.

5 Analysis of Simulated Data

This section presents the first proof-of-concept example. The simulated data set used for this example matches the structure and size of the 2000 NAEP mathematics assessment at grade 4 and comes from a previous study (von Davier & Yu, 2003). The assessment has five scales: (a) Number and Operations, (b) Measurements, (c) Geometry, (d) Data Analysis, and (e) Algebra. This section applies the stochastic EM method to find the parameter estimates for the simulated data set, which involves responses of 13,511 students. Each student responds to 1 of 26 sets of items (booklets). Each booklet contains three blocks of 12-15 items (both multiple-choice and constructed-response) out of a total 145 items. Each examinee has information on 381 predictors that are used in the latent

(23)

regression model.

This section compares estimates from the MCEMGROUP and the operational CGROUP of the posterior means and standard deviations of examinees, the regression

parameters Γ, and the residual variance-covariance matrix Σ.

As an additional benchmark, both the results of CGROUP and MCEMGROUP are compared against the generating values, that is, the individual latent trait vectors used to simulate the data.

5.1 Estimates of Γ and Σ for the Simulated Data

Figure 1 shows a comparison of the CGROUP and MCEMGROUP regression

parameters estimates (i.e., estimates of the individual parameters in Γ) for the 2000 NAEP

mathematics assessment at grade 4 by subscale. The figure points to the extreme closeness of the estimates from the two approaches.

Table 1 shows the residual variance-covariance estimates (i.e., estimates of the

individual parameters in Σ) from CGROUP and MCEMGROUP for the simulated data

set.

The table shows slightly larger estimated variances for the MCEM version of MGROUP as compared to CGROUP. The pattern is the opposite for the estimated covariances. The differences between the covariance estimates of CGROUP and MCEMGROUP seem smaller than the differences between the variance estimates of the two approaches.

5.2 Subgroup Estimates for the 2000 NAEP Mathematics Assessment at Grade 4

Figures 2 to 4 present recovery plots for CGROUP and MCEMGROUP. The figures also show subplots comparing the individual means and standard deviations from CGROUP and MCEMGROUP. Each subscale has four subplots. The recovery plots are based on individual examinee statistics, and hence will show more variance than that in

the population parameters Γ and Σ. In the recovery plots, the generating values are the

reference, and the conditional posterior moments (given the responses, the background

(24)

−0.05 0.00 0.05 0.10

−0.05

0.00

0.05

0.10

All Gamma NUM&OP

CGROUP MCEMGROUP −0.05 0.00 0.05 0.10 −0.05 0.00 0.05 0.10

All Gamma MEASUREMENT

CGROUP MCEMGROUP −0.05 0.00 0.05 0.10 −0.05 0.00 0.05 0.10

All Gamma GEOMETRY

CGROUP MCEMGROUP −0.10−0.10 −0.05 0.00 0.05 0.10 −0.05 0.00 0.05 0.10

All Gamma DATA ANALYSIS

CGROUP MCEMGROUP −0.05 0.00 0.05 −0.05 0.00 0.05

All Gamma ALGEBRA

CGROUP

MCEMGROUP

Figure 1. Regression parameter (Γ) estimates from CGROUP and MCEMGROUP for the data for 2000 NAEP mathematics assessment at grade 4.

of these parameters used when generating imputations with MGROUP) as generated by CGROUP and MCEMGROUP are plotted against these reference values. For all subscale displays, the ideal recovery or agreement line, respectively, is indicated by the main diagonal given in each plot.

Figures 2 to 4 show no obvious differences between CGROUP and MCEMGROUP

when comparing the plots showing recovery of the generating θ values. This observation

finds additional support when examining the plot of both posterior mean estimates. The posterior means generated by CGROUP and MCEMGROUP fall nicely around the diagonal line and the plot is much narrower than the “truth” recovery plots. This indicates that CGROUP and MCEMGROUP agree quite well when producing posterior means, a statement that cannot be said for the posterior standard deviation.

(25)

−4 −3 −2 −1 0 1 2

−4

−2

0

2

Num&Op Posterior mean

True Theta CGROUP −4 −3 −2 −1 0 1 2 −4 −2 0 2

Num&Op Posterior mean

True Theta MCEMGROUP −4 −3 −2 −1 0 1 2 −4 −2 0 2

Num&Op Posterior mean

MCEMGROUP CGROUP 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.25 0.40 Num&Op Posterior SD MCEMGROUP CGROUP −4 −3 −2 −1 0 1 2 3 −3 −1 1

Measurement Posterior mean

True Theta CGROUP −4 −3 −2 −1 0 1 2 3 −3 −1 1

Measurement Posterior mean

True Theta MCEMGROUP −3 −2 −1 0 1 2 −3 −1 1

Measurement Posterior mean

MCEMGROUP CGROUP 0.25 0.30 0.35 0.40 0.45 0.50 0.30 0.45 Measurement Posterior SD MCEMGROUP CGROUP

Figure 2. Posterior moments from CGROUP and MCEMGROUP compared against the generating values and the posterior means and standard deviations from CGROUP and MCEMGROUP compared against each other.

(26)

−4 −2 0 2

−4

−2

0

2

Geometry Posterior mean

True Theta CGROUP −4 −2 0 2 −3 −1 1

Geometry Posterior mean

True Theta MCEMGROUP −3 −2 −1 0 1 2 −4 −2 0 2

Geometry Posterior mean

MCEMGROUP CGROUP 0.25 0.35 0.45 0.55 0.30 0.45 Geometry Posterior SD MCEMGROUP CGROUP −4 −3 −2 −1 0 1 2 −4 0 2

Data Analysis Posterior mean

True Theta CGROUP −4 −3 −2 −1 0 1 2 −3 −1 1

Data Analysis Posterior mean

True Theta MCEMGROUP −3 −2 −1 0 1 2 −4 0 2

Data Analysis Posterior mean

MCEMGROUP

CGROUP

0.2 0.3 0.4 0.5

0.30

0.45

Data Analysis Posterior SD

MCEMGROUP

CGROUP

Figure 3. Posterior moments from CGROUP and MCEMGROUP compared against the generating values and the posterior means and standard deviations from CGROUP and MCEMGROUP compared against each other.

(27)

Table 1.

Residual Variance Covariance Estimates From CGROUP and MCEMGROUP

NUM&OPER MEASURMT GEOMETRY DATA ANL ALGEBRA

CGROUP estimates NUM&OPER 0.32592 0.30510 0.32886 0.29929 0.32429 MEASURMT 0.32481 0.31857 0.28613 0.31267 GEOMETRY 0.37103 0.30208 0.33679 DATA ANL 0.32091 0.30645 ALGEBRA 0.35941 MCEMGROUP estimates NUM&OPER 0.33872 0.29779 0.31766 0.29167 0.31602 MEASURMT 0.35134 0.30435 0.27761 0.30007 GEOMETRY 0.3883 0.29045 0.31976 DATA ANL 0.3552 0.29600 ALGEBRA 0.38106

Note. Based on the operational population model for the 2000 NAEP mathematics

assessment at grade 4.

The posterior standard deviations given by MCEMGROUP are considerably larger than for CGROUP for all subscales. It is to be determined whether this is a systematic difference, whether it is due to Monte Carlo inflation of the variance, or whether this depends on some other structural variable such as the correlation between scales or the sample size.

Table 5.2 compares the subgroup means from CGROUP and MCEMGROUP for relevant subgroups—there seems to be little difference between the two methods from this aspect as well.

Table 5.2 results show that MCEMGROUP does not perform much differently from CGROUP when reproducing group level statistics such as subgroup means and variances.

(28)

−4 −3 −2 −1 0 1 2 3 −4 −2 0 1 2

Algebra Posterior mean

True Theta CGROUP −4 −3 −2 −1 0 1 2 3 −3 −1 0 1 2

Algebra Posterior mean

True Theta MCEMGROUP −3 −2 −1 0 1 2 −4 −2 0 1 2

Algebra Posterior mean

MCEMGROUP CGROUP 0.2 0.3 0.4 0.5 0.30 0.40 0.50 Algebra Posterior SD MCEMGROUP CGROUP

Figure 4. Posterior moments from CGROUP and MCEMGROUP compared against the generating values and the posterior means and standard deviations from CGROUP and MCEMGROUP compared against each other.

6 Analysis of Real NAEP Data: 2003 Reading Assessment at Grade 4 The 2003 NAEP reading assessment at Grade 4 has data from 187,581 examinees

in the fourth grade or 9 years of age. Each student answers a literary block (to assess the

ability to read for literary experience) and an informative block(to assess the ability to read

for information). Each block consists of 10 to 12 items, with a mixture of multiple-choice and constructed-response items. We will refer to the two skills (subscales) measured by the

assessment as literary and information. The combined item sample for these two scales has

111 items in total; 688 background variables are used in the latent regression model.

Because the θis are two-dimensional, it is possible to apply BGROUP (which

performs exact numerical integration in the E-step) here. BGROUP provides the gold standard, and we will compare the results obtained from the application of the stochastic EM against it. Comparison with results from CGROUP will be provided as well.

(29)

Table 2.

Comparison of Subgroup Estimates From CGROUP and MCEMGROUP

The CGROUP MCEMGROUP

subgroup NumOp Meas Geom DA Alg NumOp Meas Geom DA Alg

Overall 0.033 0.006 –0.023 0.047 0.037 0.031 0.009 –0.013 0.060 0.039 Male 0.069 0.083 –0.043 0.061 0.089 0.066 0.086 –0.030 0.075 0.092 Female –0.004 –0.072 –0.003 0.032 –0.016 –0.006 –0.071 0.004 0.045 –0.016 White 0.272 0.310 0.248 0.361 0.270 0.270 0.311 0.254 0.369 0.269 Black –0.618 –0.838 –0.748 –0.794 –0.624 –0.617 -0.833 –0.738 –0.777 –0.623 Hispanic –0.454 –0.566 –0.550 –0.525 –0.420 –0.459 –0.560 –0.527 –0.493 –0.411 Asian 0.605 0.388 0.532 0.369 0.680 0.605 0.392 0.538 0.382 0.688 Amerind –0.407 –0.287 –0.693 –0.515 –0.429 –0.412 –0.271 –0.630 –0.154 0.014

6.1 Estimates of Γ and Σ for the Real Data

Table 3 shows the residual variance estimates ˆΣ as generated by BGROUP,

MCEMGROUP, and CGROUP for the NAEP data. Table 3.

Residual Variances, Covariances, and Correlations for the 2003 NAEP Reading Assessment at Grade 4 Data

BGROUP MCEMGROUP CGROUP

Lit. Info. Lit. Info. Lit. Info.

Lit. 0.50601 0.41801 0.48479 0.41284 0.51547 0.41782

Info. 0.82130 0.51193 0.84360 0.49401 0.80241 0.52598

Note. Lit. = Literary, Info. = Information, Residual variances = main

diagonals, covariances = upper off-diagonals,and correlations = lower off-diagonals.

The three estimates for the residual correlation and the residual variances are close. Interestingly, the estimates generated by MCEMGROUP are slightly smaller than

(30)

CGROUP estimates for the reading data, whereas the residual variances estimates for the simulated math were larger for MCEMGROUP than for CGROUP. As an outcome, the correlation is higher for MCEMGROUP than for CGROUP. Whether this is an effect of the sample size or the number of scales needs further investigation.

Figure 5 shows the regression parameter estimates as generated by CGROUP and MCEMGROUP plotted against the estimates generated by the gold standard for this data example, BGROUP. −0.04 −0.02 0.00 0.02 0.04 −0.04 0.00 0.02 0.04

All Gamma LITERARY

BGROUP MCEMGROUP −0.06 −0.04 −0.02 0.00 0.02 0.04 −0.06 −0.02 0.02

All Gamma INFORMATION

BGROUP MCEMGROUP −0.04 −0.02 0.00 0.02 0.04 −0.04 0.00 0.02 0.04

All Gamma LITERARY

BGROUP CGROUP −0.06 −0.04 −0.02 0.00 0.02 0.04 −0.06 −0.02 0.02

All Gamma INFORMATION

BGROUP

CGROUP

Figure 5. Regression coefficients for the 2003 NAEP reading assessment at grade 4. The figure shows that the 688 regression coefficient estimates each for the literary and the information scales agree to a large extent between the different versions of

MGROUP. Almost all points are found on the main diagonal, which indicates that the ˆΓ

estimates produced by BGROUP, CGROUP, and MCEMGROUP are virtually identical.

6.2 Subgroup Estimates for the 2003 NAEP Reading Assessment at Grade 4

Figure 6.2 presents plots of the posterior means as generated by CGROUP and MCEMGROUP compared to the posterior means generated by BGROUP. Figure 7

(31)

−3 −2 −1 0 1 2 −3 −2 −1 0 1 2

LITERARY Posterior mean

BGROUP MCEMGROUP −3 −2 −1 0 1 2 −3 −2 −1 0 1 2

LITERARY Posterior mean

BGROUP CGROUP −3 −2 −1 0 1 2 −3 −2 −1 0 1 2

LITERARY Posterior mean

MCEMGROUP CGROUP −3 −2 −1 0 1 2 −3 −2 −1 0 1 2

Information Posterior mean

BGROUP MCEMGROUP −3 −2 −1 0 1 2 −3 −2 −1 0 1 2

Information Posterior mean

BGROUP CGROUP −3 −2 −1 0 1 2 −3 −2 −1 0 1 2

Information Posterior mean

MCEMGROUP

CGROUP

Figure 6. Posterior means as generated by CGROUP and MCEMGROUP and plotted against the reference, BGROUP, for the NAEP 2003 grade 4 reading data.

compares the corresponding posterior standard deviations.

In contrast to the first example (that involving simulated data), there is no obvious large difference between the estimates of the posterior means or standard deviations (SDs) generated by CGROUP and MCEMGROUP. Both approaches differ slightly from the estimates generated by BGROUP. CGROUP overestimates the posterior standard deviation for larger values when compared to BGROUP. The graph comparing BGROUP and CGROUP posterior SDs shows this effect in a very similar way to what Thomas (1993) reports in his Figure 1.

MCEMGROUP, on the other hand, stays closer to BGROUP for large posterior standard deviations than CGROUP does. For small standard deviations, MCEMGROUP seems to produce slightly larger values than BGROUP, while the estimates generated by CGROUP and BGROUP agree better with smaller standard deviations.

Except for a few outliers, the posterior means as generated by MCEMGROUP agree nicely with the values produced by both BGROUP and CGROUP. There is no obvious

(32)

0.3 0.4 0.5 0.6 0.7 0.8 0.3 0.4 0.5 0.6 0.7 0.8 LITERARY Posterior SD BGROUP MCEMGROUP 0.3 0.4 0.5 0.6 0.7 0.8 0.3 0.4 0.5 0.6 0.7 0.8 LITERARY Posterior SD BGROUP CGROUP 0.3 0.4 0.5 0.6 0.7 0.8 0.3 0.4 0.5 0.6 0.7 0.8 LITERARY Posterior SD MCEMGROUP CGROUP 0.3 0.4 0.5 0.6 0.7 0.8 0.3 0.4 0.5 0.6 0.7 0.8 Information Posterior SD BGROUP MCEMGROUP 0.3 0.4 0.5 0.6 0.7 0.8 0.3 0.4 0.5 0.6 0.7 0.8 Information Posterior SD BGROUP CGROUP 0.3 0.4 0.5 0.6 0.7 0.8 0.3 0.4 0.5 0.6 0.7 0.8 Information Posterior SD MCEMGROUP CGROUP

Figure 7. Posterior standard deviations as generated by CGROUP and MCEM-GROUP and plotted against the reference, BMCEM-GROUP, for the 2003 NAEP reading assessment at grade 4 data.

indication of systematic over- or underestimation for the vast majority of subjects. The obvious outliers seen in these plots are subject to ongoing research aimed at optimizing and stabilizing the current implementation of MCEMGROUP. It will be investigated whether the stochastic integration failed to produce meaningful results due to poor coverage of the true posterior by the importance sampling distribution or whether these values are caused by small size of the importance sample, or for other less obvious reasons.

Table 4 shows the subgroup estimates and the corresponding standard deviations (in parenthesis) provided by the three methods.

There is hardly any differences between CGROUP and MCEMGROUP so far as subgroup estimates are concerned. However, when compared to BGROUP results, both CGROUP and MCEMGROUP slightly underestimate the subgroup means.

(33)

Table 4.

Comparison of Subgroup Estimates From BGROUP, CGROUP, and MCEMGROUP

Pop. BGROUP CGROUP MCEMGROUP

sub. Lit. Info. Lit. Info. Lit. Info.

Overall 0.025 (0.95) –0.003 (0.98) 0.020 (0.96) –0.008 (0.99) 0.020 (0.94) –0.003 (0.98) Male –0.093 (0.96) –0.065 (1.00) –0.100 (0.96) –0.072 (1.01) –0.099 (0.94) –0.066 (0.99) Female 0.147 (0.94) 0.060 (0.97) 0.143 (0.94) 0.057 (0.98) 0.142 (0.93) 0.061 (0.96) White 0.277 (0.87) 0.278 (0.89) 0.273 (0.87) 0.277 (0.90) 0.272 (0.86) 0.279 (0.89) Black –0.478 (0.92) –0.549 (0.92) –0.487 (0.92) –0.560 (0.93) –0.482 (0.91) –0.547 (0.91) Hisp. –0.402 (0.94) –0.493 (0.94) –0.409 (0.94) –0.503 (0.95) –0.407 (0.93) –0.492 (0.93) Asian 0.213 (0.93) 0.187 (0.97) 0.210 (0.94) 0.186 (0.98) 0.208 (0.92) 0.188 (0.96) AmInd –0.391 (0.94) –0.400 (0.94) –0.401 (0.95) –0.408 (0.95) –0.397 (0.92) –0.400 (0.93) Public –0.019 (0.96) –0.048 (0.99) –0.025 (0.96) –0.053 (1.00) –0.025 (0.94) –0.047 (0.98) Private 0.439 (0.83) 0.389 (0.87) 0.435 (0.83) 0.386 (0.88) 0.433 (0.82) 0.386 (0.86)

Note. Pop. sub. = Population subgroup, Lit. = Literary, Info. = Information. Hisp. = Hispanic.

Asian = Asian Pacific, AmInd = American Indian.

7 Conclusions and Future Work

CGROUP is the current operational method used in large scale assessments such as NAEP. Though CGROUP provides more accurate results than its predecessor (NGROUP), it is not without problems, as demonstrated by Thomas (1993); especially, CGROUP is found to inflate variance estimates for examinees with large posterior variances. As of now, there is no entirely satisfactory alternative to CGROUP. As this work shows, a stochastic EM method using importance sampling provides a viable alternative to CGROUP. Application of the importance sampling EM method to a simulated data set and a real data set reveals interesting facts.

The regression parameter estimates are extremely close for BGROUP (the gold standard), CGROUP (the current operational version of MGROUP used in NAEP, etc.), and the newly proposed MCEMGROUP. The estimates of conditional posterior means seem

(34)

to be quite close for CGROUP and MCEMGROUP. In the real data example, both of these set of estimates are very close to the BGROUP estimates. The MCEM algorithm produced a few outlying estimated conditional posterior means, especially in the real data example, that may be due to MC error; this issue needs further investigations. Statements about agreement between posterior standard deviations cannot be clearly made as of now. For the real data example, which has a large sample size, CGROUP provides inflated estimates of posterior SDs for examinees with large SDs, while the MCEMGROUP does not. However, MCEMGROUP results in a few outliers in the middle of the range of the SDs. Interestingly, the pattern is quite different for the simulated data, where the MCEMGROUP estimates of posterior SDs are bigger than those from CGROUP. Overall, the stochastic EM method performs slightly better than the CGROUP version of MGROUP.

One problem with the stochastic EM method is that it is very time-consuming. For example, for our real data example (2003 reading data), while the BGROUP takes 8 hours and CGROUP takes 4 hours on a Pentium IV computer with 2.2 Ghz, MCEMGROUP takes about 48 hours, which is longer than CGROUP by a factor of 12. However, with the advent of increasingly fast computers, application of MCEMGROUP will become more feasible; at least this method can be used to get a second opinion right now.

It is possible to improve the efficiency of the importance sampling EM method even

further. Let us consider estimation of E(θ|X,Y,Γt,Σt), which is given in (15). Let us

denoteθ0 to be the mode of p(θ|X,Y,Γt,Σt). We have,

E(θ|X,Y,Γt,Σt) ≡

Z

θp(θ|X,Y,Γt,Σt)dθ ≡

Z

θexp{u(θ)}dθ, for exp{u(θ)} ≡p(θ|X,Y,Γt,Σt)

≡ Z θexp{u(θ0) + 1 2(θ−θ0) 0u00(θ 0)(θ−θ0) + ∆(θ)}dθ as u0(θ0)≡0 ≡ eu(θ0) Z θexp 1 2(θ−θ0) 0u00(θ 0)(θ−θ0) exp(∆(θ))dθ where ∆(θ) =u(θ)−u(θ0)− 12(θ−θ0)0u00(θ0)(θ−θ0)·

The first term under exponentiation in the last step above forms a normal

distribution and ∆(θ) is much less variable (and close to zero) over the range of θ than is

(35)

should lead to more stable results and the importance sample size to reach the same level is accuracy and precision will be greatly reduced.

There are a number of issues that need to be answered before the method is implemented in practice. First, a study of the Monte Carlo error in estimating the conditional posterior means and variances is needed. Second, we would like to study the effect of different importance sample sizes on the parameter estimates obtained, mainly to find out what is the optimum sample size required to obtained reasonable accuracy in the estimation process. Third, the issue of convergence of the EM algorithm used in MCEMGROUP needs to be explored further. Fourth, the optimal choice of the degrees

of freedom of the t importance sampling density is an open question and needs further

investigation. Finally, it will be of interest to use adaptive importance sample size, as in Booth and Hobert (1999), who start with a small sample size, but gradually increase the size.

(36)

References

Beaton, A. (1987). The NAEP 1983-84 technical report. Princeton, NJ: ETS.

Bauer, H. (1972)Probability theory and elements of measurement theory. New York: Holt,

Rinehart, and Winston.

Beaton, A., & Zwick, R. (1992). Overview of the National Assessment of Educational

Progress. Journal of Educational and Behavioral Statistics, 17, 95-109.

Booth, J. G., & Hobert, J. P. (1999). Maximizing generalized linear mixed model

likelihoods with an automated Monte Carlo EM algorithm. Journal of the Royal

Statistical Society, Series B, 61(1), 265-285.

Broniatowski, M., Celeux, G., & Diebolt, J. (1983). Reconnaissance de melanges de

densites par un algorithme d’apprentissage probabiliste. Data Analysis and

Informatics 3,359-373.

Celeux, G., & Diebolt, J. (1985). The SEM algorithm: A probabilisitic teacher algorithm

derived from the EM algorithm for the mixture problem. Computational Statistics, 2,

73-82.

Clarkson, D. B., & Gonzalez, R. (2001). Random effects diagonal metric multidimensional

scaling models. Psychometrika, 66(1), 25-43.

Cochran, W. G. (1977). Sampling techniques. New York: John Wiley & Sons.

Cohen, J. D., & Jiang, T. (1999). Comparison of partially measured latent traits across

normal populations. Journal of the American Statistical Association, 94, 1035-1044.

Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from

incomplete data via the EM algorithm. Journal of the Royal Statistical Society, B,

39, 1-38.

von Davier, M. (2003). Comparing conditional and marginal direct estimation of subgroup

distributions (ETS RR-03-02). Princeton, NJ: ETS.

von Davier, M., & Yu, H. T. (2003). Recovery of population characteristics from sparse

matrix samples of simulated item responses. Paper presented at the annual meeting of the National Council on Measurement in Education, Chicago.

von Davier, M., & Yon, H. (2004). A conditioning model with relaxed assumptions. Paper

(37)

Education, San Diego, CA.

Fox, J. P. (2003). Stochastic EM for estimating the parameters of multilevel IRT model. British Journal of Mathematical and Statistical Psychology, 56, 65-81.

Gelman, A., Carlin, J. B., Stern, H. S., & Rubin, D. B. (2003). Bayesian data analysis.

New York: Chapman & Hall.

Geweke, J. (1989). Bayesian inference in econometric models using Monte Carlo

integration. Econometrica, 57, 1317-1339

Gilks, W. R., Richardson, S., & Spiegelhalter, D. J. (1996). Markov chain Monte Carlo in

practice. London: Chapman & Hall.

Greene, W. H. (2002) Econometric analysis(4th ed.). Upper Saddle River, NJ: Prentice

Hall.

Johnson, M. S. (2002). A Bayesian hierarchical model for multidimensional performance

assessments. Paper presented at the annual meeting of the National Council on Measurement in Education, New Orleans, LA.

Johnson, M. S., & Jenkins. F. (in press). A Bayesian hierarchical model for large-scale

educational surveys: An application to the National Assessment of Educational Progress. Princeton, NJ: ETS.

Kass, R., & Steffey, D. (1989). Approximate Bayesian inference in conditionally

independent hierarchical models. Journal of the American Statistical Association, 84,

717-726.

Kirsch, I. (2001) The International Adult Literacy Survey (IALS): Understanding what

was measured(ETS RR-01-25). Princeton, NJ: ETS.

Lee, S. Y., Song X. Y., & Lee, J. C. K. (2003). Maximum likelihood estimation of

nonlinear structural equation models with ignorable missing data. Journal of

Educational and Behavioral Statistics, 28, 111-124.

Martin, M. O., & Kelly D. L. (1996). TIMSS Technical Report: Vol. I. Design and

development. Chestnut Hill, MA: Boston College.

McCulloch, C. E. (1994). Maximum likelihood variance components estimation for binary

data. Journal of the American Statistical Association, 89,330-335.

(38)

empirical investigation of bridge sampling. Journal of the American Statistical Association, 91,1254-1267

Mislevy, R. (1984). Estimating latent distributions. Psychometrika, 44,358-381.

Mislevy, R., Johnson, E., & Muraki, E. (1992). Scaling procedures in NAEP. Journal of

Educational and Behavioral Statistics, 17, 131-154.

Mislevy, R. (1985). Estimation of latent group effects. Journal of the American Statistical

Association, 80,993-997.

Mullis, I. V. S., Martin, M. O., Gonzalez, E. J., & Kennedy, A. M. (2003). PIRLS 2001

international report: IEA’s study of reading literacy achievement in primary schools. Chestnut Hill, MA: Boston College.

Thomas, N. (1993). Asymptotic corrections for multivariate posterior moments with

factored likelihood functions. Journal of Computational and Graphical Statistics,

2(3), 309-322.

Wei, G. C. G., & Tanner, M. (1990). A Monte Carlo implementation of the EM algorithm

and the poor man’s data augmentation algorithms. Journal of the American

Statistical Association,85,699-704

Zellner, A. (1962). An efficient method for estimating seemingly unrelated regressions and

(39)

Notes

(40)

References

Related documents

In addition to the overall U-shaped pattern, the topographic indicators derived from the fitted fold catastrophe, including beginning puff speed (Y 0 ), difference between

This study aims to answer to the question of whether there is a relationship between preservice science teachers’ epistemological beliefs towards NOS and their science

In this article, we investigate the meromorphic solutions of certain non-linear difference equations using Tumura-Clunie theorem and also provide examples which satisfy our

likely to be handicapped than a child of like birth weight born to a mother without. such

A system of major characteristics indicators for rural household energy consumption includes effective heat consumption/or livelihood per capita per day (EHC), the share

After successfully supporting the development of the wind power technology, an approach is needed to include the owners of wind turbines in the task of realizing other ways, other

More specifically, can the federal govern- ment set standards of performance, encourage experi- mentation in the delivery o f health care, coordinate existing