• No results found

The target population: estimation and inaccuracy

Part 1: Sampling Errors

5.4 The target population: estimation and inaccuracy

5.4.1 Estimation procedures and information needed

As stated several times, the target population has reference times of basic variables like SIC code that are equal to those of the statistics. For example, both annual and short-term statistics referring to year t should be based on delineation of units and SIC codes of that year. The frame is based on a BR at a time too early to achieve this.

There are several possibilities at the estimation stage, with different ambitions for updating the information and, at the same time, with different results as to accuracy with respect to frame errors (coverage deficiencies). Whatever the procedure chosen, the resulting (in)accuracy needs to be measured.

A typical situation is a design with stratification by industry and size. A random sample is drawn for each stratum. The greatest size strata have the selection probability equal to one. The stratification into sets of SIC codes corresponds to the domains, each stratum being equal to (or more detailed than) a domain. Size is used in the stratification to improve accuracy. The basic estimator of the total value of production, say, for a particular industry is then simply a sum over the size groups for that industry. The variance of the estimator is also computed by summing over these strata. The estimation procedure can use a Horvitz- Thompson estimator, expanding sample values by inverted probabilities of selection (in the case of full response), see further Chapter 2. This is so for the sampling unit and its domain as given by the frame. With a different observation unit, the contribution to a particular industry will also come from other strata, for example if enterprises are sampled and their kind-of- activity units are the observation units.

There are further possible estimators, depending on what information is available in addition to that in the frame. There are two main reasons to use further information:

• to reduce bias by including corrections and updates;

• to reduce variance through utilising auxiliary information.

The amount of further information may vary: it can be limited to the sample or it can be available for the population, for example in terms of further variables or an updated BR.

Some situations are described below in Sections 5.4.2-5.4.5. For estimation procedures, see Chapters 2-3 or the literature, for example for calibrated weights in generalised regression estimators see Deville & Särndal (1992).

5.4.2 Using the frame population only

The simplest estimation procedure is to keep to the frame population, that is, each unit keeps its domain of estimation as on the frame. As described above, each pair of point estimate and standard error is computed by summing over the corresponding strata.

This procedure can be used not only for classification but also for units that are in fact dead or otherwise not belonging to the target population, by treating them like nonresponse. If there is no renewal of the sample, such an estimation procedure can be regarded as including a model assumption on the relationship of under- and over-coverage: that they are equal in size. There is bias due to under- and over-coverage for the population as a whole and for each domain, unless the assumption is true. When the birth rate is high compared to the death rate, there is under-estimation and vice versa.

Care needs to be taken in using simplified assumptions. Investment provides a particular challenge. New units and ones which are growing are likely to be strong investors. Conversely units which are struggling and, as a result, diminishing in size will have little opportunity to buy new assets. Elvers (1993) discusses this for a survey based on a cut-off sample with the restriction 20 employees or more.

An alternative – leaving the frame information to some extent – is to identify the over- coverage and put variable values equal to zero for these units. If there is no renewal of the sample, there is then an imbalance, since over-coverage but not under-coverage is taken into account.

Illustrations: Table 5.3-Table 5.4 in Section 5.7.3 show an example of over- and under- coverage with a cut-off survey. The bias due to an old SIC-code is shown for an example in Figure 5.1 in Section 5.7.2.

5.4.3 Updating the sample only

If the units in the sample have their domain “checked” in the survey, interior movements and corrections can be taken into account by assigning each sample unit to its proper domain of estimation. This implies that the bias from this error source is eliminated. There is, however, an increased variance due to including this information – which may be a rare characteristic – based on sample information only. Chapter 3 provides formulas in its sections on domain estimation, for example a simple case in Section 3.1.2.

There may, in fact, be quite a difference in going from (i) the variance coming from a small set of “tailor-made” strata as indicated in Sections 5.4.1-5.4.2, to (ii) the variance derived from these strata and some further strata where a few units with actual values contribute to the variance together with a large number of nil values. This is a consequence of frame deficiency.

There are also exterior movements/corrections, units leaving and entering the population. An update in the first respect means for example identifying over-coverage and giving it a nil value. There is then an asymmetry if no action is taken for the under-coverage, as stated in Section 5.4.2. Either additional sampling or model assumptions are needed to estimate for units not in the population originally sampled, the frame population. A very simple model is to assume equal effects between over- and under-coverage, but this assumption is only likely to be realistic when the economy is stable – and not always even then.

For units included with probability one, changes can be made without affecting the variance, for example reorganisations can be taken into account and classification updates can be made, as long as each such unit represents itself only. However, care must be taken if surveys into different sectors are run independently. For example, if such a unit is reclassified from retailing to manufacturing, it could be removed from the retailing survey. A second action needs to be taken at the same time to ensure it is included in the manufacturing survey. There may be difficulties in doing this in practice.

Illustrations: The increase in variance (or rather its square root) when updating an old SIC- code based on sample information is shown for an example in Figure 5.1 in Section 5.7.2. 5.4.4 Utilising later BR information on the population

A situation with even more information is where there is a further variable for all units, not used in the design, or where there is renewed information on the original design variables. One estimation method is so-called poststratification, where a stratification variable is added at the estimation stage. The calibration technique is an example of including such auxiliary information (possibly quantitative) to improve the estimation. This may lead to a reduction of both bias and variance. It is a model-assisted estimation method that is used for the surveyed part of the population.

Movements of units into the population are not included in the procedures just mentioned. They require model-based procedures with assumptions about these units. Again, there is an asymmetry to be overcome.

There are illustrations of changes in SIC code and number of employees from one year to the next in Table 5.2 and Figure 5.2, respectively. Table 5.1 has SIC code for a shorter period. 5.4.5 Utilising a BR covering the reference period

The technique of constructing a BR covering a period was described above in Section 5.3.7. The target population is here considered fully known. This is, of course, a simplification, since some errors will remain. This BR is, however, a considerable improvement over the version at the time of frame construction. From the estimation point of view, the situation with this BR covering the reference period is roughly the same as that in Section 5.4.4 in terms of methods and assumptions. This means for example that poststratification and calibration methods are available for interior movements.

Movements out of the population are identified, that is, the over-coverage is known. The under-coverage is also identified. The estimation has to be model-based for those units (unless there is time for further questionnaires), using for example similar units in the surveyed part of the population and/or administrative data. Again, the reasoning is based on this late BR covering a period showing the truth; in practice there are, of course, remaining deficiencies.

In Section 5.7.3, Table 5.3-Table 5.4 illustrate over- and under-coverage with a cut-off survey, and there is information on the “extra” units provided by the BR covering the calendar year.

5.4.6 Some comments on the BR and effects of coverage deficiencies

Discussions on the topic of quality of a BR are going on at the EU level (Eurostat 1998a). The connections between Business Registers and the statistics using them are getting stronger. There is an increasing interest in business demography, and regular work on quality assessment of business registers is taking place at some statistical offices. See also Struijs & Willeboordse (1995), Archer (1995), and Griffiths & Linacre (1995), already mentioned, and illustrations below.

The measurement of inaccuracy caused by coverage deficiencies may be undertaken in three different ways:

1) Review updating procedures of the BR to look at time delays. This will provide a broad indicator only, but it is available at the time when the frame is constructed.

2) Compare units on an updated BR with the BR used. Counts can be made of the number of units erroneously included or excluded. Likewise the number of units classified to the wrong domain of estimation can be evaluated.

3) Compute approximately the level of inaccuracy. Estimates can be made for the frame population and for the estimated target population, using a variable that is available at the population level (for example turnover from VAT, or salaries and wages or number of employees from PAYE). Whilst this method provides the most information it is the most demanding and resource intensive.

The illustrations in Section 5.5 are tied to the BR, and Sections 5.6-5.7 provide a range of illustrations for frames, although nearly restricted to the UK and Sweden. Most illustrations in Sections 5.6 and 5.7 belong to the first and second of the above methods. There are, however, a few examples on accuracy measures in Section 5.7 belonging to the third method. This is the preferable one, since a quality assessment should aim at the effects of frame errors (coverage deficiencies).