A Model for a Nested Study Ed Stanek

(1)

A Model for a Nested Study Ed Stanek

Introduction

We present notation that can be used to express a simple model for a nested study. The model is based on the assumption that response has been reported for subjects selected via a two stage cluster sample. At the first stage, a simple random sample of clusters (which we will also refer to as primary sampling units (PSU)) is selected. At the second stage, a simple random sample of subjects (which we will also refer to as second stage sampling units (SSU)) in the selected clusters will be selected. We index the selection of the clusters in the sample by

1,...,

i= n, and the selection of the subjects in a selected cluster as j=1,...,m. We assume that there is a large and equal number of SSUs in each PSU, and a large number of PSUs in the population. We define the population, write a model for a random response, and then develop estimates of parameters in the model that minimize the sum of squared error.

Study Population and Parameters

We assume the study population consists of a large number of clusters that are numbered 1,...,

s= N, where each cluster contains a large number of subjects t =1,...,M. As an example, the clusters may be hospitals, with the subjects representing patients in a hospital. We represent the potentially observable response of a subject in a cluster as a fixed constant µst. Using this representation of response, we define parameters for the mean and variance of cluster sas

1 1 M s st t M µ µ = =

∑

and

(

)

2 2 1 2 1 1 1 M st s t s s M M v M M M µ µ σ = ⎛ ₋ ⎞ ⎜ ⎟ − − ⎛ ⎞ ₌⎛ _⎞⎜ _⎟₌ ⎜ ⎟ ⎜ ⎟ ₋ ⎜ ⎟ ⎝ ⎠ ⎝ ⎠ ⎜ ⎟ ⎝ ⎠

∑

fors=1,...,N.

We describe parameters for the population of clusters next. As for the individual clusters, we define the mean and variance as

1 1 N s s N µ µ = =

∑

and

(

)

2 2 1 2 1 1 1 N s s N N v N N N µ µ σ = ⎛ ₋ ⎞ ⎜ ⎟ − − ⎛ ⎞ ₌⎛ _⎞⎜ _⎟₌ ⎜ ⎟ ⎜ ⎟ ₋ ⎜ ⎟ ⎝ ⎠ ⎝ ⎠ ⎜ ⎟ ⎝ ⎠

∑

.

We use these definitions of parameters to define a model for a response in the population. The potentially observable response for a subject is given by

(

) (

)

st s st s s st µ µ µ µ µ µ µ β ε = + − + − = + +

where β_s =

(

µ_s−µ

)

represent the deviation of a cluster mean from the overall mean, and

(

)

(2)

The Study

The study is conducted by randomly selecting i=1,...,nclusters, and within each selected cluster, randomly selecting j=1,...,m subjects and then observing response. The index

irepresents the position of a cluster in the sample, while the index jrepresents the position of a subject in the selected cluster. Since the subject assigned to position jin the ith selected cluster is random, we represent response as the random variable

( ) 1 1 N M s ij is jt st s t Y U U µ = = =

∑ ∑

representing response for the jthselected subject in the ith selected cluster. The response can also be represented via the model:

ij i ij

Y = + +µ B E

where B_i is a random variable that represents the difference between the ith selected cluster from the population mean, and E_ijis a random variable that represents the difference between the

th

j selected subject in the ith selected cluster, and the mean of theith selected cluster.

Vector and Matrix Representation of the Population and Model

The study consists of selecting a simple random sample of clusters and a sample of SSUs within the selected clusters. Although only a subset of the population will be selected in the study, the entire population can be represented as a vector of random variables. We use this representation, and subsequently describe the subset that will correspond to the realized sample.

Simple random sampling without replacement can be represented using a vector of random variables. The random variables are defined such that they can take on the values corresponding to any permutations of the values in the population. Each potential realization of the random variables is equally likely. With this representation, a sample of size nis the first

nrandom variables in the vector.

This representation can be extended to two stage sampling. With two stage sampling, the SSUs in a PSU are represented by a vector of random variables, in addition to the PSUs in the population. We define the population and model for the population using these ideas.

First, we represent the clusters in a random permutation of clusters via the positions indexed by i=1,...,N. Next, we represent a random permutation of SSUs in theith cluster by the vector Y_i =

(

Y_i₁ Y_i₂ " Y_iM

)

′, and the response vector for the population as

(

1 2

)

1 N NM× ′ ′ ′ ′ =

(3)

(

1 2

)

1 i i i i i i iM M B E B E B E × ′ = + + + E " and

(

1 2

)

1 N NM× ′ ′ ′ ′ =

E E E " E . Using this notation, we can summarize the model for the population as

1

NMY× =Xα+E where

1 NM

NMX× =1 and α µ= .

We evaluate the expected value and the variance of the population vector next. Let us represent expectation with respect to selection of PSUs with the subscript ξ₁, and expectation with respect to selection of SSUs with the subscript ξ2. Since Eξ₁

( )

Bi =0, and Eξ ξ2|1

( )

Eij =0,

then

( )

1 2

E_{ξ ξ} Y =Xµ. We make use of expressions the conditional expansion of the variance,

( )

1 2 1 2|1 1 2|1

var_{ξ ξ} Y =E_ξ ⎡_⎣var_{ξ ξ} Y _⎦⎤+var_ξ _⎣⎡E_{ξ ξ} Y ⎤_⎦ to evaluate the variance. In so doing, we take advantage of results developed for single stage sampling given in c01ed11.doc.

In a single stage problem, the variance was derived for a vector of random variables that corresponds to a random permutation of the values in the population. With Npopulation values,

( )

1 1 var N N N σ _N 2 × ⎛ ⎞ = _⎜ − _⎟ ⎝ ⎠

Y I J . The variance of a vector of the first nelements in this permutation

vector is given by the upper n n× matrix given by 1 1 1 var n n n N σ2 × ⎛ ⎞ ⎛ ⎞ = − ⎜ ⎟ ⎜ ⎟ ⎝ ⎠Y ⎝I J ⎠. We use these

results to develop expressions for the variance in the context of two stage sampling. The Population Variance in Two Stage Sampling

We first evaluate expressions for the mean and variance of permutations of second stage units (SSUs) given a specified PSU. Let

1 i M×

Y represent a random permutation of SSUs for a given PSU corresponding to the ith PSU in a random permutation of PSUs. Then

( )

2|1 i i M E_{ξ ξ} Y =m1 and

( )

2 1 2 | 1 var s_i _M _M M ξ ξ = ⎛⎜ − ⎞⎟ ⎝ ⎠

Y I J . Note that in these expressions, m_i is the realization of the

random variable 1 N i is s s M U µ = =

∑

and 2 i

s is the realization of the random variable 2 2 1 N i is s s S U σ = =

∑

.

Combining these results for all i=1,...,N ,

( )

2|1 _NM ₁ M E_{ξ ξ} × = ⊗ Y m 1 where

(

( )

)

1 i Nm× = m and

( )

2 1 2 1 2 2 | 1 2 0 0 0 0 1 var 0 0 M M NM N s s M s ξ ξ _× ⎛ ⎞ ⎜ ⎟ ⎛ ⎞ ⎜ ⎟ =_⎜ ⊗_⎜ − _⎟ ⎟ ⎝ ⎠ ⎜ ⎟ ⎜ ⎟ ⎝ ⎠ Y I J " " # # % # "

(4)

( )

[

]

1 2 1 1 2 1 2 2 2 0 0 0 0 1 var var 0 0 M M M N s s E M s ξ ξ ξ ξ ⎡⎛ ⎞ ⎤ ⎢⎜ ⎟ ⎥ ⎛ ⎞ ⎢⎜ ⎟ ⎥ = _⎢_⎜ ⊗_⎜ − _⎟_⎥+ ⊗ ⎟ ⎝ ⎠ ⎢⎜_⎜ ⎟_⎟ ⎥ ⎢⎝ ⎠ ⎥ ⎣ ⎦ Y I J m 1 " " # # % # "

. Now, since for all

1,..., i= N,

( )

1 2 2 2 1 1 N i s e s E S N ξ σ σ = =

∑

= and

[

]

[ ]

1 1

varξ m⊗1M =varξ m ⊗JM where

[ ]

1 1 var _N _N N ξ =σ2⎛⎜ − ⎞⎟ ⎝ ⎠ m I J ,

( )

1 2 2 1 1 var e N M M N N M M N ξ ξ =σ ⊗⎜⎛ − ⎞⎟+σ2⎛⎜ − ⎞⎟⊗ ⎝ ⎠ ⎝ ⎠ Y I I J I J J . It is

valuable to express this variance in a slightly different manner by grouping terms. Thus,

( )

1 2 2 2 var e N e M M N M M N ξ ξ σ σ σ σ2 2 ⎡ ⎛ ⎞ ⎤ = ⊗_⎢ +_⎜ − _⎟ _⎥− ⊗ ⎝ ⎠ ⎣ ⎦ Y I I J J J or equivalently,

( )

(

)

1 2 2 2 1 var e N e M M N N M N N M N ξ ξ σ σ σ σ2 2 ⎡ ⎛⎛ − ⎞ ⎞ ⎤ = ⊗_⎢ +_⎜_⎜ _⎟ − _⎟ _⎥+ − ⊗ ⎝ ⎠ ⎝ ⎠ ⎣ ⎦ Y I I J I J J .

Re-parameterization of Variance Components

Some authors represent the expression for the variance in terms of different parameters

such that 2 2 1 1 _e e M M N N M ρ ρ ρ ρ σ σ σ τ ρ ρ 2 2 ⎛ ⎞ ⎜ ₁ ⎟ ⎛⎛ − ⎞ ⎞ _⎜ _⎟ +_⎜_⎜ _⎟ − _⎟ = ⎜ ⎟ ⎝ ⎠ ⎝ ⎠ _⎜ _⎟ ⎜ ₁⎟ ⎝ ⎠ I J " " # # % # "

. For these two expressions to be

equal, 2 2 1 1 1 1 e e N N M N M N M σ σ ρ σ σ 2 2 − ⎛ ⎞ ₋ ⎜ ⎟ ⎝ ⎠ = − − ⎛ ⎞ ₊⎛ ⎞ ⎜ ⎟ ⎜ ⎟ ⎝ ⎠ ⎝ ⎠ , and 2 N 1 2 M 1 _e2 N M τ =⎛_⎜ − ⎞_⎟σ +⎛_⎜ − ⎞_⎟σ

⎝ ⎠ ⎝ ⎠ . Note that the variance

between cluster means is defined as v2 N 1 2

N σ

−

⎛ ⎞

= ⎜ ⎟

⎝ ⎠ , while the average variance within clusters

we denote as 2 2 2 1 1 1 N 1 1 N e s s s s M v v N = M σ N = − ⎛ ⎞ = _⎜ _⎟ = ⎝ ⎠

∑

. As a result, 2 2 2 2 1 e e v M v v σ ρ = − + , and 2 2 2 e v v τ = + . Solving this expression for σ2, 1 1 1 2

1 1 e N M N M M σ ρ σ ρ 2 ₌ ⎛ ⎞⎡ ⎛ − ⎞₊ ⎤ ⎜ ⎟⎢ ⎜ ⎟ ⎥ − ⎝ − ⎠⎣ ⎝ ⎠ ⎦ . Sinceσ 0 2_≥ , this implies that 1 1 M ρ≥= −

− . The parameter ρ is the intra-class correlation. Due to the finite

population size, and sampling without replacement, values of ρ may be negative. The lower limit for ρ approaches zero when the number of SSUs is very large (see (Valliant, Dorfman et

(5)

Approximations for the Variance with Large Numbers of PSUs and SSUs We approximate the variance given by

( )

1 2 2 1 1 var e N M M N N M M N ξ ξ =σ ⊗⎜⎛ − ⎞⎟+σ2⎛⎜ − ⎞⎟⊗ ⎝ ⎠ ⎝ ⎠ Y I I J I J J

by dropping terms that depend on 1

N and

1

M when N and M are large. The simplification

results in the expression 1 2

( )

(

)

~

2

varξ ξ Y =I_N ⊗ σ_eI_M +σ2J_M . This matrix is called compound

symmetric. The covariance of different PSU vectors in the population is zero. The variance of the PSU vectors are identical, and equal to 1 2

( )

~ 2 varξ ξ j σe M σ M 2 = + Y I J .

(6)

References

Valliant, R., H. A. Dorfman, et al. (2000). Finite population sampling and inference. New York, John Wiley and Sons