Systematic sampling - Precision essentially depends only the absolute sample size, not the rela

Precision essentially depends only the absolute sample size, not the relative fraction of the population sampled

3.6 Systematic sampling

Sometimes, logistical considerations make a true simple random sample not very convenient to administer.

For example, in the previous creel survey, a true random sample would require that a random number be

generated for each boat returning to the marina. In such cases, a systematic sample could be used to select elements. For example, every 5^thangler could be selected after a random starting point.

3.6.1 Advantages of systematic sampling

The main advantages of systematic sampling are:

• it is easier to draw units because only one random number is chosen

• if a sampling frame is not available but there is a convenient method of selecting items, e.g. the creel survey where every 5^thangler is chosen.

• easier instructions for untrained staff

• if the population is in random order relative to the variable being measured, the method is equivalent to a SRS. For example, it is unlikely that the number of anglers in each boat changes dramatically over the period of the day. This is an important assumption that should be investigated carefully in any real life situation!

• it distributes the sample more evenly over the population. Consequently if there is a trend, you will get items selected from all parts of the trend.

3.6.2 Disadvantages of systematic sampling

The primary disadvantages of systematic sampling are:

• Hidden periodicities or trends may cause biased results. In such cases, estimates of mean and standard errors may be severely biased! See Section 4.2.2 for a detailed discussion.

• Without making an assumption about the distribution of population units, there is no estimate of the standard error. This is an important disadvantage of a systematic sample! Many studies very casually make the assumption that the systematic sample is equivalent to a simple random sample without much justification for this.

3.6.3 How to select a systematic sample

There are several methods, depending if you know the population size, etc. Suppose we need to choose every k^threcord, where k is chosen to meet sample size requirements. - an example of choosing k will be given in class. All of the following methods are equivalent if k divides N exactly. These are the two most common methods.

• Method 1 Choose a random number j from 1 · · · k.. Then choose the j, j + k, j + 2k, · · · records.

One problem is that different samples may be of different size - an example will be given in class where n doesn’t divide N exactly. This causes problems in sampling theory, but not too much of a problem if n is large.

• Method 2 Choose a random number from 1 · · · N . Choose very k^thitem and continue in a circle when you reach the end until you have selected n items. This will always give you the same sized sample, however, it requires knowledge of N

3.6.4 Analyzing a systematic sample

Most surveys casually assume that the population has been sorted in random order when the systematic sample was selected and so treat the results as if they had come from a SRSWOR. This is theoretically not correct and if your assumption is false, the results may be biased, and there is no way of examining the biases from the data at hand.

Before implementing a systematic survey or analyzing a systematic survey, please consult with an expert in sampling theory to avoid problems. This is a case where an hour or two of consultation before spending lots of money could potentially turn a survey where nothing can be estimated, into a survey that has justifiable results.

3.6.5 Technical notes - Repeated systematic sampling

To avoid many of the potential problems with systematic sampling, a common device is to use repeated systematic samples on the same population.

For example, rather than taking a single systematic sample of size 100 from a population, you can take 4 systematic samples (with different starting points) of size 25.

An empirical method of obtaining a standard error from a systematic sample is to use repeated system-atic sampling. Rather than choosing one systemsystem-atic subsample of every k^thunit, choose, m independent systematic subsample of size n/m. Then estimate the mean of each sub-systematic sample. Treat these means as a simple random sample from the population of possible systematic samples and use the usual sampling theory. The variation of the estimate among the sub-systematic samples provides an estimate of the standard error (after an appropriate adjustment). This will be illustrated in an example.

Example of replicated subsampling within a systematic sample

A yearly survey has been conducted in the Prairie Provinces to estimate the number of breeding pairs of ducks. One breeding area has been divided into approximately 1000 transects of a certain width, i.e. the breeding area was divided into 1000 strips.

What is the population of interest? As noted in class, the definition of a population depends, in part, upon the interest of the researcher. Two possible definitions are:

• The population is the set of individual ducks on the study area. However, no frame exists for the individual birds. But a frame can be constructed based on the 1000 strips that cover the study area. In this case, the design is a cluster sample, with the clusters being strips.

• The population consists of the 1000 strips that cover the study area and the number of ducks in each strip is the response variable. The design is then a simple random sample of the strips.

In either case, the analysis is exactly the same and the final estimates are exactly the same.

Approximately 100 of the transects are flown by an aircraft and spotters on the aircraft count the number of breeding pairs visible from the aircraft.

For administrative convenience, it is easier to conduct systematic sampling. However, there is structure to the data; it is well known that ducks do not spread themselves randomly through out the breeding area.

After discussions with our Statistical Consulting Service, the researchers flew 10 sets of replicated systematic samples; each set consisted of 10 transects. As each transect is flown, the scientists also classify each transect as ‘prime’ or ‘non-prime’ breeding habitat.

Here is the raw data reporting the number of nests in each set of 10 transects:

Prime Non-Prime

Non-Set Habitat Habitat ALL Prime prime

Total n Total n Total mean mean Diff

(b) (a) (c) (d) (e)

1 123 3 345 7 468 41.0 49.3 -8.3

2 57 2 36 8 93 28.5 4.5 24.0

3 85 5 46 5 131 17.0 9.2 7.8

4 97 2 131 8 228 48.5 16.4 32.1

5 34 5 43 5 77 6.8 8.6 -1.8

6 85 3 67 7 152 28.3 9.6 18.8

7 56 7 64 3 120 8.0 21.3 -13.3

8 46 2 65 8 111 23.0 8.1 14.9

9 37 4 43 6 80 9.3 7.2 2.1

10 93 2 104 8 197 46.5 13.0 33.5

Avg 71.3 165.7 10.97

s 29.5 117.0 16.38

n 10 10 10

Est

Est total 7130 16570 mean 10.97

Est se 885 3510 se 4.91

Several different estimates can be formed.

1. Total number of nests in the breeding area (refer to column (a) above). The total number of nests in the breeding area for all types of habitat is of interest. Column (a) in the above table is the data that will be used. It represents the total number of nests in the 10 transects of each set.

The principle behind the estimator is that the 1000 total transects can be divided into 100 sets of 10 transects, of which a random sample of size 10 was chosen. The sampling unit is the set of transects – the individual transects are essentially ignored.

Note that this method assumes that the systematic samples are all of the same size. If the systematic samples had been of different sizes (e.g. some sets had 15 transects, other sets had 5 transects), then a ratio-estimator (see later sections) would have been a better estimator.

• compute the total number of nests for each set. This is found in column (a).

• Then the sets selected are treated as a SRSWOR sample of size 10 from the 100 possible sets. An estimate of the mean number of nests per set of 10 transects is found as:µ = (468 + 93 + · · · +b

197)/10 = 165.7 with an estimated se of se(µ) =b

2. Total number of nests in the prime habitat only (refer to column (b) above). This is formed in exactly the same way as the previous estimate. This is technically known as estimation in a domain.

The number of elements in the domain in the whole population (i.e. how many of the 1000 transects are in prime-habitat) is unknown but is not needed. All that you need is the total number of nests in prime habitat in each set – you essentially ignore the non-prime habitat transects within each set.

The average number of nests per set in prime habitats is found as before:bµ =^{123+···+93}₁₀ = 71.3 with an estimated se of se(bµ) =

qs²

n(1 − ₁₀₀ⁿ ) = q29.5²

10 (1 −₁₀₀¹⁰) = 8.85.

• because there are 100 sets of transects in total, the estimate of the population total number of nests in prime habitat and its estimated se isbτ = 100bµ = 7130 with a se(bτ ) = 100se(µ) = 885b

• Note that the total number of transects of prime habitat is not known for the population and so an estimate of the density of nests in prime habitat cannot be computed from this estimated total.

However, a ratio-estimator (see later in the notes) could be used to estimate the density.

3. Difference in mean density between prime and non-prime habitats The scientists suspect that the density of nests is higher in prime habitat than in non-prime habitat. Is there evidence of this in the data? (refer to columns (c)-(e) above). Here everything must be transformed to the density of nest per transect (assuming that the transects were all the same size). Also, pairing (refer to the section on experimental design) is taking place so a difference must be computed for each set and the differences analyzed, rather than trying to treat the prime and non-prime habitats as independent samples.

Again, this is an example of what is known as domain-estimation.

• Compute the domain means for type of habitat for each set (columns (c) and (d)). Note that the totals are divided by the number of transects of each type in each set.

• Compute the difference in the means for each set (column (e))

• Treat this difference as a simple random sample of size 10 taken from the 100 possible sets of transects. What does the final estimated mean difference and se imply?

In document PDFbigbook SAS (Page 148-153)