• No results found

Probability sampling

In document ST104a Vle (Page 148-152)

9.7 Types of sample

9.7.2 Probability sampling

Probability sampling means that every population unit has aknown, (not necessarily equal) non-zero probability of being selected in the sample. In all cases selection is performed through some form of

randomisation, for example using a pseudo-random number generator.

Relative to non-probability methods it can be expensive and time-consuming, and also requires a sampling frame. We aim to minimise both the (random) sampling error and the systematic sampling bias. Since the probability of selection is known, standard errors can be computed which allows hypothesis tests to be

performed and confidence intervals to be constructed.

We shall consider five types of probability sampling:

Simple random sampling (SRS) — a special case where each population unit has a known, equal, non-zero probability of selection. Of the various probability samples, its simplicity is desirable and it produces unbiased estimates, although more accurate methods exist (i.e. those with smaller standard errors).

Systematic random sampling — a 1-in-x systematic random sample is obtained by randomly selecting one of the first x units in the sampling frame and then selecting every subsequent x-th unit. Though easy to implement, it is important to consider how the sampling frame is compiled. For example, the data may exhibit a periodic property such as sales in a shop. All Monday sales are likely to be similar, as are all Saturday sales, but Saturdays are traditionally much busier shopping days. So sales from a 1-in-7 systematic sample will have a large variation between days, but a small variation within days. This would lead us to underestimate the true variation in sales. A 1-in-8

systematic sample would be better. Can you say why?

Stratified random sampling — this sampling technique achieves a higher level of accuracy (lower standard errors) by exploiting natural groupings within the population. Such

groupings are called strata4and these are characterised as 4‘Strata’ is the plural form, ‘stratum’ is the singular form.

having population units which are similar within strata, but different between strata. (More formally, elements within a stratum are homogeneous while the strata are collectively heterogeneous.)

A simple random sample is taken from each stratum, thus ensuring a representative overall sample since a good cross section of the population will have been selected — provided suitable strata have been created. Hence great care needs to be taken when choosing the appropriatestratification factors, which should be relevant to the purpose of the sample survey such as age, gender, etc.

Imagine we were investigating student satisfaction levels at a university. If we took a simple random sample of students we could, by chance, select just (or mainly) first-year students whose opinions may very well differ from those in other year groups. So in this case a sensible stratification factor would be

‘year of study’. By taking a simple random sample from each stratum you ensure that you will not end up with an extreme sample and also avoid the possibility of one particular group not being represented at all in the sample. Of course, in order to perform stratified random sampling, we would need to be able to allocate each population unit to a stratum. For students this should be straightforward since the university’s database of student names would no doubt include year of study too.

Cluster sampling — is used to reduce costs. Here the

population is divided into clusters, ideally such that each cluster is as variable as the overall population (i.e. heterogeneous clusters). Next, some of the clusters are selected by SRS. This is where the economy savings are expected since typically the clusters may be constructed on a geographical basis thus allowing interviewers to restrict themselves to visiting households in certain areas rather than having to travel long distances to meet the respondents required by a simple random sample who potentially cover a large area. Aone-stage cluster sample would mean that every unit in the chosen clusters is surveyed.

It may be that the cluster sizes are large meaning that it is not feasible to survey every unit within a cluster. Atwo-stage cluster sample involves taking a simple random sample from the clusters selected by SRS. When a subsample is taken from a selected cluster, this is known as amultistage design.

Individual respondents will be identified in a random manner — but crucially within an area. You will reduce costs (the

interviewer will be able to complete a higher number of

interviews in a given time, and use less petrol and shoe leather), but will probably have to sacrifice a degree of accuracy, that is the method is less efficient in general. However cluster sampling is very useful when a sampling frame is not immediately

available for the entire population. For example individual universities will have databases of their own students, but a national student database does not exist (admittedly it may be possible to create, but it would take time and cost money).

Clustering is clearly useful in an interviewer-administered survey. It is less important as a design feature for telephone or postal interviews unless you are particularly interested in the cluster itself. To the extent that individuals in a cluster are similar (havingintra-class correlation they will be less representative of other clusters). In other words, the variance (hence standard error) of your sample estimate will be greater.

A further reason, in addition to cost, for cluster sampling, may arise from your need as a researcher to look at the clusters themselves for their own sake. An interest in income and educational levels for a group living in one area, or a study of the children at a particular school and their reaction to a new television programme, will require you to look at individuals in a cluster.

Multistage sampling — refers to the case when sample

selection occurs at two or more successive stages (the two-stage cluster sample above is an example). Multistage sampling is frequently used in large surveys. During the first stage, large compound units are sampled (primary units). During the second stage, smaller units (secondary units) are sampled from the primary units. From here, additional sampling stages of this type may be performed as required until we finally sample the basic units.

As we have already seen, this technique is often used in cluster sampling so that we initially sample main clusters, then clusters within clusters, etc. We can also usemultistage sampling with a mixture of techniques. A large government survey will likely incorporate elements of both stratification and clustering at

different stages. A typical multistage sample in the UK might involve the following:

• Divide the areas of the country into strata by industrial region.

• Sample clusters (local areas) fromeach industrial region.

• From each local area choose some areas for which you have lists (say electoral register, or postcode) and take a simple random sample from the chosen lists (clusters).

Note that from a technical perspective, stratified sampling can be thought of as an extreme form of two-stage cluster sampling where at the first stage all clusters in the population are selected. In addition, one-stage cluster sampling is at the opposite end of this spectrum:

Stratified sampling: all strata, some units in each stratum.

One-stage cluster sampling: some clusters, all units in each selected cluster.

Two-stage cluster sampling: some clusters, some units in each selected cluster.

Activity

You have been asked to make a sample survey of each of the following. Would you use random or quota sampling? Explain.

Airline pilots, for their company, about their use of holiday entitlement in order to bring in a new work scheme.

In this case as the survey is for the company (and there is therefore no confidentiality issue) it is quite easy to use the company’s list of personnel. A quota sample would not be very easy in these circumstances: you would have to send your interviewers to a venue where most pilots would be likely to meet, or you would risk a very unrepresentative sample.

So in this case a random sample would be easy and efficient to use. You would be able to collect accurate information and use your statistical techniques on it. The subject matter, too, means that it is likely the pilots would take the survey more seriously if they were contacted through the company’s list.

Possible tourists, about their holiday destinations and the likely length of time and money they expect to spend on holiday in the next year, for a holiday company planning its holiday schedule and brochure for next year.

The situation for the tourist survey is different from that for airline pilots. There will be not one, but several, lists of tourists from different holiday companies, and data confidentiality might well mean you could not buy lists which do not belong to your company. You might use the register of voters or list of households, but then you would not necessarily target those thinking about holidays in the near future. So a random sample sounds like an expensive option if this is to be a general study for a tourist company assessing its future offers and illustrations for its holiday brochure. Here, a quota sample makes more sense: interviewers can quickly find the right respondent for the

company’s needs and get a general picture of holiday-makers’

preferences.

Household expenditure for government assessment of the effect of different types of taxes.

The government survey will require accuracy of information in an important policy area. A random sample will have to be used and the national lists of addresses or voters used.

9.8 Types of error

We can distinguish between two types of error in sampling design:5 5These shouldnotbe confused with Type I and Type II errors, which only concern hypothesis testing.

Sampling error: This occurs as a result of us selecting a sample, rather than performing a census (where a total enumeration of the population is undertaken). It is attributable to random variation due to the sampling scheme used. For probability sampling, we can estimate the statistical properties of the sampling error, i.e. we can compute (estimated) standard errors which facilitate the use of hypothesis testing and construction of confidence intervals.

Non-sampling error: This occurs as a result of (inevitable) failures of the sampling scheme. In practice it is very difficult to quantify this sort of error, typically through separate

investigation. We distinguish between two sorts of non-sampling error:

Selection bias — this may be due to (i.) the sampling frame not being equal to the target population, or (ii.) the sampling frame not being strictly adhered to, or (iii.) non-response bias.

Response bias — the actual measurements might be wrong due to, for example, ambiguous question wording,

misunderstanding of a word in a questionnaire by less educated people, or sensitivity of information which is sought. Interviewer bias is another aspect of this, where the interaction between the interviewer and interviewee influences the response given in some way, either intentionally or unintentionally, such as through leading questions, the dislike of a particular social group by the interviewer, the interviewer’s manner or lack of training, or perhaps the loss of a batch of questionnaires from one local post office. These could all occur in an unplanned way and bias your survey badly.

In document ST104a Vle (Page 148-152)