Consider once again the CS design described before, namely a stratied two-stage cluster sample design whereby the population is grouped into H strata and the units in each stratum have been grouped into PSU's. A sample of PSU's is selected from each stratum and then the units in each selected PSU is grouped into SSU's. A sample of SSU's is taken from each of the sampled PSU's and these then form the USU's of the sample. Recall that the USU's are nally assigned a sampling weight dened as the inverse of the inclusion probability of each unit. Refer to section 2.6 for a discussion of how the sampling weights are developed. This sampling design is illustrated in the gure below.
CHAPTER 5. SIMULATION MODEL 114
Figure 5.2.1: General Stratied Two-stage Cluster Sample Design
Now, relate this CS design to the hierarchical structure described in the previous section. by, rstly, considering each stratum as an independent population. This is a reasonable assumption to make since strata are dened as non-overlapping subgroups of the population as a whole. Consider
stratum h, for which the units have been grouped into Nh PSU's and from which a sample of size
nh has been selected. The PSU level in a CS design relates to the level two units in the multilevel
model. A further sampling level is dened by grouping the units in each PSU into SSU's. For the
jth sampled PSU in stratum h, that contains Nhj SSU's, nhj SSU's are selected. The SSU level in
this design relates to the level one units dened in the multilevel model. Since no further sampling will take place beyond the SSU level, these units will be considered the ultimate sampling units, or USU's.
The simulation of this type of data is quite uncommon and thus limited examples were available to study. However, the few that were available tended to dene sampling schemes in such a way that they delivered informative samples. Kim and Skinner (2013) denes informative sampling simply as a sampling scheme related to the response variable of a regression analysis, conditional on the independent variables. For a discussion on informative and non-informative sampling, the reader is referred to Kim and Skinner (2013).
CHAPTER 5. SIMULATION MODEL 115 Now consider two examples of using multilevel models to simulate CS data:
• Pfeermann et al. (1998) dened the model,
yij = β + uj+ νij, j = 1, ..., N, i = 1, ..., Nj,
where N is the number of level 2 (or PSU's) and Nj the number of level 1 (or SSU's) units,
respectively, to simulate, uj ∼ N (0, ω2), the level 2 random eect, and νij ∼ N (0, σ2), the
level 1 random eect. The authors chose β = 1, ω2 = 0.2, σ2 = 0.5 and M = 300. The sizes
of the level 2 units were calculated as
Nj = 75 exp (˜uj) ,
where ˜uj was generated from N (0, ω2) and then limited to lie within the interval 1.5ω ≤
˜
uj ≤ 1.5ω. Thus, the size range of the level 2 units is from 38 to 147 when considering the
parameter values specied (Pfeermann et al., 1998). Next the authors dened 3 dierent sampling schemes for sampling from the simulated population:
1. Sample n level 2 units, PPS, with the measure of size (MOS), Xj, simulated using
the level 2 random eect, Xj = 75 exp (uj). From this it follows that the selection
probability of the jth level 2 unit is calculated as πj =
n·Xj
PN
j=1Xj. Next, the level 1 units
in each sampled level 2 unit were partitioned according to their associated random
eects, hence forming 2 strata. Specically, level 1 units with νij > 0were assigned to a
rst stratum and the other level 1 units to the second stratum. SRS was used to sample
level 1 units from each stratum, 0.25 · nj from stratum 1 and 0.75 · nj from stratum
2, and nj was either a xed quantity or proportional to Nj, the number of population
level 1 units in the jth sampled level 2 unit (Pfeermann et al., 1998). Suppose nj is
chosen as a xed quantity, say n. It follows that the selection probability of the ith
level 1 unit given the jth level 2 unit was selected, is πi|j = Nn
j. This sampling scheme
is used to ensure informative sampling at both levels, i.e. the inclusion probability of the ultimate sampling unit, conditional on the covariates, is related to the outcome of interest (Kim et al., 2013). The inclusion probability of a level 1 unit in this case is
given by πij = n·Xj PN j=1Xj n j Nj.
2. The second sampling scheme dened by Pfeermann et al. (1998) ensures that the sampling is informative only at level 2. It is the same as the sampling scheme dened in (1), but with SRS employed to sample the level 1 units within each sampled level 2 unit.
3. The nal sampling scheme leads to a non-informative sample, i.e. completely random at both levels. This is similar to the sampling scheme described in (2), but here the
CHAPTER 5. SIMULATION MODEL 116
MOS, Xj, is set equal to the population size of the jth level 2 unit, i.e. Xj = Nj
(Pfeermann et al., 1998). Now the inclusion probability of a level 1 unit is given by πij = n·Nj PN j=1Nj n j Nj, which simplies to πij = n·nj P jNj.
• Asparouhov et al. (2005) dened the model,
yij = µj + λjηi+ εij, i = 1, ..., n, j = 1, ..., 5,
to simulate a population from which a CS, similar to the one described in gure 5.2.1, can be selected with the PSU's being selected, without replacement, with equal probabilities and the SSU's with or without replacement, also with equal probabilities. The data are to be used
for factor analysis. In the model, µj is the intercept parameter, λj is the loading parameter,
ηi ∼ N (0, ψ) is the factor variable, and εij ∼ N (0, θj) is the residual variable (Asparouhov
et al., 2005). Hence, the parameters used for this model, are Θ = (µ1, ..., µ5, λ1, ..., λ5, θ1, ..., θ5, ψ) .
The authors proceeded by generating a population of size 50000 with 5 outcomes, each distributed normally with mean and variance given by the model, using the parameter values specied in Θ. After doing so the authors grouped the simulated population observations in such a way as to resemble a two-level structure. The observations were, rstly, grouped into 140 PSU's by ordering the observations according to some function, f, which the authors chose as
fi =
X
j
yij,
to ensure informative sampling. The observations were then ranked according to their re- spective f-scores and then assigned to the PSU's. Of the 140 PSU's, the rst 120 received 250 observations each and the remaining 20 received 1000 each. Finally, the two-stage sample was selected as described above (Asparouhov et al., 2005).
Note that in further discussion the level 2 units will be called PSU's and the level 1 units SSU's to remain in line with the terminology used in gure 5.2.1.