A Model for a Nested Study Ed Stanek
Introduction
We present notation that can be used to express a simple model for a nested study. The model is based on the assumption that response has been reported for subjects selected via a two stage cluster sample. At the first stage, a simple random sample of clusters (which we will also refer to as primary sampling units (PSU)) is selected. At the second stage, a simple random sample of subjects (which we will also refer to as second stage sampling units (SSU)) in the selected clusters will be selected. We index the selection of the clusters in the sample by
1,...,
i= n, and the selection of the subjects in a selected cluster as j=1,...,m. We assume that there is a large and equal number of SSUs in each PSU, and a large number of PSUs in the population. We define the population, write a model for a random response, and then develop estimates of parameters in the model that minimize the sum of squared error.
Study Population and Parameters
We assume the study population consists of a large number of clusters that are numbered 1,...,
s= N, where each cluster contains a large number of subjects t =1,...,M. As an example, the clusters may be hospitals, with the subjects representing patients in a hospital. We represent the potentially observable response of a subject in a cluster as a fixed constant µst. Using this representation of response, we define parameters for the mean and variance of cluster sas
1 1 M s st t M µ µ = =
∑
and(
)
2 2 1 2 1 1 1 M st s t s s M M v M M M µ µ σ = ⎛ − ⎞ ⎜ ⎟ − − ⎛ ⎞ =⎛ ⎞⎜ ⎟= ⎜ ⎟ ⎜ ⎟ − ⎜ ⎟ ⎝ ⎠ ⎝ ⎠ ⎜ ⎟ ⎝ ⎠∑
fors=1,...,N.We describe parameters for the population of clusters next. As for the individual clusters, we define the mean and variance as
1 1 N s s N µ µ = =
∑
and(
)
2 2 1 2 1 1 1 N s s N N v N N N µ µ σ = ⎛ − ⎞ ⎜ ⎟ − − ⎛ ⎞ =⎛ ⎞⎜ ⎟= ⎜ ⎟ ⎜ ⎟ − ⎜ ⎟ ⎝ ⎠ ⎝ ⎠ ⎜ ⎟ ⎝ ⎠∑
.We use these definitions of parameters to define a model for a response in the population. The potentially observable response for a subject is given by
(
) (
)
st s st s s st µ µ µ µ µ µ µ β ε = + − + − = + +where βs =
(
µs−µ)
represent the deviation of a cluster mean from the overall mean, and(
)
The Study
The study is conducted by randomly selecting i=1,...,nclusters, and within each selected cluster, randomly selecting j=1,...,m subjects and then observing response. The index
irepresents the position of a cluster in the sample, while the index jrepresents the position of a subject in the selected cluster. Since the subject assigned to position jin the ith selected cluster is random, we represent response as the random variable
( ) 1 1 N M s ij is jt st s t Y U U µ = = =
∑ ∑
representing response for the jthselected subject in the ith selected cluster. The response can also be represented via the model:
ij i ij
Y = + +µ B E
where Bi is a random variable that represents the difference between the ith selected cluster from the population mean, and Eijis a random variable that represents the difference between the
th
j selected subject in the ith selected cluster, and the mean of theith selected cluster.
Vector and Matrix Representation of the Population and Model
The study consists of selecting a simple random sample of clusters and a sample of SSUs within the selected clusters. Although only a subset of the population will be selected in the study, the entire population can be represented as a vector of random variables. We use this representation, and subsequently describe the subset that will correspond to the realized sample.
Simple random sampling without replacement can be represented using a vector of random variables. The random variables are defined such that they can take on the values corresponding to any permutations of the values in the population. Each potential realization of the random variables is equally likely. With this representation, a sample of size nis the first
nrandom variables in the vector.
This representation can be extended to two stage sampling. With two stage sampling, the SSUs in a PSU are represented by a vector of random variables, in addition to the PSUs in the population. We define the population and model for the population using these ideas.
First, we represent the clusters in a random permutation of clusters via the positions indexed by i=1,...,N. Next, we represent a random permutation of SSUs in theith cluster by the vector Yi =
(
Yi1 Yi2 " YiM)
′, and the response vector for the population as(
1 2)
1 N NM× ′ ′ ′ ′ =(
1 2)
1 i i i i i i iM M B E B E B E × ′ = + + + E " and(
1 2)
1 N NM× ′ ′ ′ ′ =E E E " E . Using this notation, we can summarize the model for the population as
1
NMY× =Xα+E where
1 NM
NMX× =1 and α µ= .
We evaluate the expected value and the variance of the population vector next. Let us represent expectation with respect to selection of PSUs with the subscript ξ1, and expectation with respect to selection of SSUs with the subscript ξ2. Since Eξ1
( )
Bi =0, and Eξ ξ2|1( )
Eij =0,then
( )
1 2
Eξ ξ Y =Xµ. We make use of expressions the conditional expansion of the variance,
( )
( )
( )
1 2 1 2|1 1 2|1
varξ ξ Y =Eξ ⎡⎣varξ ξ Y ⎦⎤+varξ ⎣⎡Eξ ξ Y ⎤⎦ to evaluate the variance. In so doing, we take advantage of results developed for single stage sampling given in c01ed11.doc.
In a single stage problem, the variance was derived for a vector of random variables that corresponds to a random permutation of the values in the population. With Npopulation values,
( )
1 1 var N N N σ N 2 × ⎛ ⎞ = ⎜ − ⎟ ⎝ ⎠Y I J . The variance of a vector of the first nelements in this permutation
vector is given by the upper n n× matrix given by 1 1 1 var n n n N σ2 × ⎛ ⎞ ⎛ ⎞ = − ⎜ ⎟ ⎜ ⎟ ⎝ ⎠Y ⎝I J ⎠. We use these
results to develop expressions for the variance in the context of two stage sampling. The Population Variance in Two Stage Sampling
We first evaluate expressions for the mean and variance of permutations of second stage units (SSUs) given a specified PSU. Let
1 i M×
Y represent a random permutation of SSUs for a given PSU corresponding to the ith PSU in a random permutation of PSUs. Then
( )
2|1 i i M Eξ ξ Y =m1 and
( )
2 1 2 | 1 var si M M M ξ ξ = ⎛⎜ − ⎞⎟ ⎝ ⎠Y I J . Note that in these expressions, mi is the realization of the
random variable 1 N i is s s M U µ = =
∑
and 2 is is the realization of the random variable 2 2 1 N i is s s S U σ = =
∑
.Combining these results for all i=1,...,N ,
( )
2|1 NM 1 M Eξ ξ × = ⊗ Y m 1 where
(
( )
)
1 i Nm× = m and( )
2 1 2 1 2 2 | 1 2 0 0 0 0 1 var 0 0 M M NM N s s M s ξ ξ × ⎛ ⎞ ⎜ ⎟ ⎛ ⎞ ⎜ ⎟ =⎜ ⊗⎜ − ⎟ ⎟ ⎝ ⎠ ⎜ ⎟ ⎜ ⎟ ⎝ ⎠ Y I J " " # # % # "( )
[
]
1 2 1 1 2 1 2 2 2 0 0 0 0 1 var var 0 0 M M M N s s E M s ξ ξ ξ ξ ⎡⎛ ⎞ ⎤ ⎢⎜ ⎟ ⎥ ⎛ ⎞ ⎢⎜ ⎟ ⎥ = ⎢⎜ ⊗⎜ − ⎟⎥+ ⊗ ⎟ ⎝ ⎠ ⎢⎜⎜ ⎟⎟ ⎥ ⎢⎝ ⎠ ⎥ ⎣ ⎦ Y I J m 1 " " # # % # ". Now, since for all
1,..., i= N,
( )
1 2 2 2 1 1 N i s e s E S N ξ σ σ = =∑
= and[
]
[ ]
1 1varξ m⊗1M =varξ m ⊗JM where
[ ]
1 1 var N N N ξ =σ2⎛⎜ − ⎞⎟ ⎝ ⎠ m I J ,( )
1 2 2 1 1 var e N M M N N M M N ξ ξ =σ ⊗⎜⎛ − ⎞⎟+σ2⎛⎜ − ⎞⎟⊗ ⎝ ⎠ ⎝ ⎠ Y I I J I J J . It isvaluable to express this variance in a slightly different manner by grouping terms. Thus,
( )
1 2 2 2 var e N e M M N M M N ξ ξ σ σ σ σ2 2 ⎡ ⎛ ⎞ ⎤ = ⊗⎢ +⎜ − ⎟ ⎥− ⊗ ⎝ ⎠ ⎣ ⎦ Y I I J J J or equivalently,( )
(
)
1 2 2 2 1 var e N e M M N N M N N M N ξ ξ σ σ σ σ2 2 ⎡ ⎛⎛ − ⎞ ⎞ ⎤ = ⊗⎢ +⎜⎜ ⎟ − ⎟ ⎥+ − ⊗ ⎝ ⎠ ⎝ ⎠ ⎣ ⎦ Y I I J I J J .Re-parameterization of Variance Components
Some authors represent the expression for the variance in terms of different parameters
such that 2 2 1 1 e e M M N N M ρ ρ ρ ρ σ σ σ τ ρ ρ 2 2 ⎛ ⎞ ⎜ 1 ⎟ ⎛⎛ − ⎞ ⎞ ⎜ ⎟ +⎜⎜ ⎟ − ⎟ = ⎜ ⎟ ⎝ ⎠ ⎝ ⎠ ⎜ ⎟ ⎜ 1⎟ ⎝ ⎠ I J " " # # % # "
. For these two expressions to be
equal, 2 2 1 1 1 1 e e N N M N M N M σ σ ρ σ σ 2 2 − ⎛ ⎞ − ⎜ ⎟ ⎝ ⎠ = − − ⎛ ⎞ +⎛ ⎞ ⎜ ⎟ ⎜ ⎟ ⎝ ⎠ ⎝ ⎠ , and 2 N 1 2 M 1 e2 N M τ =⎛⎜ − ⎞⎟σ +⎛⎜ − ⎞⎟σ
⎝ ⎠ ⎝ ⎠ . Note that the variance
between cluster means is defined as v2 N 1 2
N σ
−
⎛ ⎞
= ⎜ ⎟
⎝ ⎠ , while the average variance within clusters
we denote as 2 2 2 1 1 1 N 1 1 N e s s s s M v v N = M σ N = − ⎛ ⎞ = ⎜ ⎟ = ⎝ ⎠
∑
∑
. As a result, 2 2 2 2 1 e e v M v v σ ρ = − + , and 2 2 2 e v v τ = + . Solving this expression for σ2, 1 1 1 21 1 e N M N M M σ ρ σ ρ 2 = ⎛ ⎞⎡ ⎛ − ⎞+ ⎤ ⎜ ⎟⎢ ⎜ ⎟ ⎥ − ⎝ − ⎠⎣ ⎝ ⎠ ⎦ . Sinceσ 0 2≥ , this implies that 1 1 M ρ≥= −
− . The parameter ρ is the intra-class correlation. Due to the finite
population size, and sampling without replacement, values of ρ may be negative. The lower limit for ρ approaches zero when the number of SSUs is very large (see (Valliant, Dorfman et
Approximations for the Variance with Large Numbers of PSUs and SSUs We approximate the variance given by
( )
1 2 2 1 1 var e N M M N N M M N ξ ξ =σ ⊗⎜⎛ − ⎞⎟+σ2⎛⎜ − ⎞⎟⊗ ⎝ ⎠ ⎝ ⎠ Y I I J I J Jby dropping terms that depend on 1
N and
1
M when N and M are large. The simplification
results in the expression 1 2
( )
(
)
~
2
varξ ξ Y =IN ⊗ σeIM +σ2JM . This matrix is called compound
symmetric. The covariance of different PSU vectors in the population is zero. The variance of the PSU vectors are identical, and equal to 1 2
( )
~ 2 varξ ξ j σe M σ M 2 = + Y I J .
References
Valliant, R., H. A. Dorfman, et al. (2000). Finite population sampling and inference. New York, John Wiley and Sons