Multiple Level Sampling

Previously, we discussed single-level surveys where either the variables of interest or these variables plus covariates on the same sample units were measured. When covariates were mea-sured, we assumed that the covariate values for all units in the population or, at least the population total, were known. Often the covariates are useful for estimation and are cheaper to collect but unknown. It often pays to collect covariate information on a large sample and the variable of inter-est on a subsample.

This approach is referred to as multilevel sampling. For example, in a timber sale we might obtain ocular estimates of diameter at breast height D (and henceD²) for a large sample of trees and measure actual volume of some of these trees. Or, in estimating recreational use of an area, we may use traffic counters recording counts of vehicles at the entrance to the park on a large number of days and actually count the number of users on a subset of these days.

Multilevel sampling can be separated into multiphase and multistage sampling.

Multistage Sampling

Refers to sampling designs where the ultimate sample units, called elements, are selected in stages. Samples at each stage are taken from the sample units comprising clusters of units selected in the previous stage. Interest is in estimation of attribute totals or means per element, such as biomass per tree rather than per ha. The population is first divided into a number of primary sample units (PSU), some of which are selected as the first stage sample. These selected PSUs are then subdivided into a series of secondary sample units (SSU), some of which are randomly selected as the second stage sample. This process can be repeated of course with additional stages if necessary.

The procedure has the advantage of concentrating work on a relatively small number of PSUs after which much less effort is usually needed to obtain the second and later samples.

The main reasons for selecting a multistage sample are:

1. Drawing a set of units from a population such as trees in a large forest or recreation users of a park over a full season is expensive. It is difficult to obtain a list of all the trees and even more difficult to determine all users of a park.

2. Even if a list of population units was available, efficiency might dictate that groups of units (clusters) rather than single units be chosen and that only some units in each cluster are measured. For example, it is usually cheaper to sample 20 randomly located clusters of 30 trees in a forest than 600 randomly located trees and we may want to only sample 10 out of the 30 trees in each cluster because of the homogeneity in the cluster or the expense of measuring all 30 trees. In sampling recreation users, it is clearly easier to select and subsample random days on which to interview all users rather than attempt to randomly sample individual users or days, respectively.

Generally, though as indicated earlier, there is a definite tradeoff in efficiency between cluster sampling and random sampling of units because units close together are often more similar than those further apart and it often pays to measure only some of them in each selected cluster.

Sampling can be in a large number of stages. We illustrate how this works with the simple and often practically useful situation of 2-stage sampling with SRS at each stage. Assume N groups or clusters with M_iunits (i = 1, …, N) in the i^thcluster. Our total of interest can now be written as:

1 1 1

N N

ij i

i j i

Y y Y

= = =

∑∑

∑

⁽⁶⁷⁾

In 2-stage sampling a random sample of n is selected out of the N clusters but instead of measur-ing all units in the cluster, a random sample of m_iunits is chosen in each. Thus, the cluster total

Yi is first estimated by

for each of the n clusters sampled. Our estimated total is:

∑

, the between-cluster variance, and

(

)

per PSU. Similarly, an unbiased variance estimator v Y( )ˆ ^is:

2 2 2

There is a considerable literature on multistage sampling, but this subject is still best discussed in the book by Murthy (1967).

Multiphase Sampling

In multiphase sampling the same size of sample units are retained at each level (phase) but with fewer sample units selected at each consecutive one. In the last phase the variable of interest is measured and is combined with covariate information from the early phases either in design (strati-fication or pps sampling) or estimation (regression or ratio estimation). In multiphase sampling a complete frame of units is required since a sample of units is selected at each phase. The main reason for using multiphase sampling is to reduce the cost of sampling by collecting a large amount

of relatively cheap information on covariates that are correlated with the variables of interest and then measuring the variables of interest on a smaller sample. Stratified double sampling and double sampling for regression or ratio estimation are two examples. Specifically:

1. For stratified double sampling, the large (first phase) sample information is used to construct strata from which the second phase samples are selected. Typically this is done if interest is in specific subpopulations (strata) or the strata are more homogeneous than the overall popula-tion so that efficiency is gained by stratificapopula-tion. For example, in tradipopula-tional large-scale timber surveys we might have a large sample of say n^'1-ha plots from remote sensing or photos classified into primarily large timber, pole timber, and regeneration. Clearly, if interest is in volume, those three strata would be of interest in their own right and are likely to be much more homogeneous (if sufficiently well done by remote sampling) than the overall popula-tion. A subsample of those 1-ha plots would then be sampled on the ground for volume by stratum. Similarly in sampling a large park for recreation use, we might take a large sample of photos on sample days to count users, use that information to divide the park into strata of heavy, moderate, and low use days, and then sample these three strata on a subset of those same sample days. The estimator of the total in both cases is:

= n is the estimated number of sample units in stratum h,

' phase sample sizes for stratum h and overall respectively, n_his the second phase sample in stratum h, and y_stis the estimated mean for stratum h for the sample of n_hunits in that stratum. The vari-ance of this estimator is: yst the sample mean for stratum h and overall sample mean for stratified sampling respectively.

An almost unbiased sample estimator of V Y(ˆ_dst), if both 1/N and 1

Strata may be of different degrees of interest and vary in homogeneity, so varying sampling rates may be desirable. This requires knowledge of or an estimate of the variability within the strata in order to allocate n. If such knowledge is available, one can then optimally allocate the sample to the strata. Assume that there is information available or easily collectable on a variable x correlated with y. Then applying the simple cost function:

where C′is the cost of classifying a unit for the first phase and C_his the cost of measuring a unit in stratum h, the expected cost E(C) is:

Then the optimal n'can be computed from by substituting

2 2

for v_hwhere s²_yand s²_yhare the estimated variance for variable y in the population and stratum h respectively and w_0his the estimated stratum weight for stratum h based on the preliminary infor-mation. More complex cost functions are discussed in the literature, especially Hansen and others (1953), but usually insufficient information is available to assume a better cost function, so it makes sample size computations more difficult and sample size determination seems fairly insen-sitive to improved cost functions.

2. For double sampling with ratio or regression estimators, a linear relationship is assumed be-tween the covariates and the variables of interest as shown in the general linear model in (39).

For instance in the timber example above, one may have confidence that the information on the 1-ha remotely sensed or photo plots is linearly related to the same information as measured on the ground. Or, similarly, the photo counts of recreational users might be linearly related to the actual counts on the ground. Clearly, whether such a linear relationship exists as a useful approximation or there is a useful but unknown relationship between the remote sensing and the ground informa-tion in both cases determines whether stratified double sampling or double sampling with ratio/

regression estimation is more efficient and reliable. The regression estimator of the overall total is:

1 1

where π_jais the probability of selecting unit j in the sample of n'^{units and}πithe probability of selecting unit i in the sample of n units and ¹

Deriving a classical variance estimator for this estimator is difficult and this is an example where bootstrap variance estimation would be the method of choice.

Illustration: A large sample of n’ plots is measured for plot basal areas on aerial photos. These could be stratified into K strata, selecting either a subsample of n plots in the K strata or a SRS of n out of the n’ plots which are then measured on the ground. Using BAT_i^andVT_i to denote basal area on plot i as measured on the photo plots and volume as measured on the ground plots,

we then have:

n’ plots with BAT i_i, =1,...,n′

n plots with BAT h_hi, =1,...,K i, =1,...,n_h^andVT h_hi, =1,...,K i, =1,...,n_h for stratified double sampling or BAT i_i, =1,...,n^andVT i_i, =1,...,n for double sampling with regression.

Whether stratified double sampling or double sampling with a regression estimator would be used, depends on the relationship expected between BAT_i and VT_i. If there is expected to be a linear relationship, regression estimation would be used, otherwise double sampling for stratification is indi-cated.

For stratified double sampling one would use (72) and (74) to estimate total volume and its variance.

For double sampling with regression one would use (76) with a bootstrap variance estimator.

If one expects the relationships between the covariates and the variables of interest to go through the origin approximately, a double sampling with a ratio of means estimator can be used:

In this case too it is best to use bootstrapping to estimate the variance of ˆ

Ydrm. Here too the

π

may not always be computable.

Multilevel sampling methods in forestry are common especially for large scale surveys. For example:

1. Double sampling for stratification is used in large-scale surveys such as FIA. Areas are strati-fied usually into forested vs. non-forested areas by either photography or more commonly now by data collected from remote sensing sources such as the Landsat Thematic Mapper Satellite (TM) and then ground plots are measured in those strata. In the past, with primary interest in timber, prestratification was used. Now post-stratification is used because plots are grid-based. Newer remote sensing sources will define small features on the ground better and locations of both the ground and remote sensing information can be pinpointed more accu-rately with improved GPS units. It is likely that more detailed stratification and regression estimation will improve estimation in the future.

2. VRP sampling with subsequent selection of trees by either Poisson sampling proportional to estimated tree heights or another subsampling scheme were frequently used in timber sales.

Clearly combinations of multiphase and multistage sampling can be desirable too. For instance, in example 1 above we might select a random sample of trees on the selected ground plots. This design then would be double sampling for stratification with random subsampling.

In document muestreo (Page 72-77)