Part 1: Sampling Errors
5.3 Frame and target populations
5.3.1 Target population
As stated, the target parameters have the reference time for both units and variables equal to the current month/quarter/year. The target population could for example be all enterprises or all kind-of-activity units in the manufacturing industry which are active in the current period. 5.3.2 Frame, and frame population
Ideally there is a perfect frame which lists every unit in the target population once and only once together with basic design variables. In reality the frame is affected by various imperfections for several reasons, for example time delays and coding mistakes. For business statistics, like SBS and STS, the frame is normally based on a BR.
The frame population for a particular survey is based on the target population of that survey. It is normally expressed in the same way as the target population, that is, in terms of units, SIC codes, and possibly size; for example “all enterprises in the manufacturing industry”. It uses the information available in the BR, and it may put on restrictions, for example that the enterprises included are active when the frame is constructed.
An annual survey collects data after the reference year, and a short-term survey collects data during the year (shortly after each month/quarter). If the frame is constructed shortly before
sending out the questionnaires, that time is at the end of the year for the annual survey, and shortly before the reference year for the short-term survey. The latter may take further samples during the year. Anyhow, the frame errors are different for these two sets of statistics – unless the annual statistics deliberately use the same frame as the short-term statistics for the sake of agreement, compare Chapter 10.
The frame population is based on the information that is available at that time. For short-term
statistics regarding year t, the SIC codes refer to year
( )
t−1 at best – more likely to year(
t−2)
or possibly even earlier, depending on the production time of the statistics used and thefrequency of updating. In the case of the manufacturing industry this normally depends on when PRODCOM information becomes available.
Note: PRODCOM is short for the French words “Production communautaire” meaning Community production.
5.3.3 Differences between the frame population and the target population There are two types of differences between the frame and target population:
• differences for the population as a whole;
• differences within the population, affecting domains (sub-populations).
Another way of expressing this is the classification of for example an enterprise into surveys or into domains within a survey. (This could be manufacturing versus service industries, and industries within the manufacturing industry, respectively.) Those two cases will be dealt with in Sections 5.3.4 and 5.3.5, respectively.
A part of the target population may deliberately be left out of the survey, for example enterprises below a certain size may be cut off. The estimation for this part of the population has to be based on model assumptions, see Chapters 4 and 9. Administrative data may be useful, especially if there are variables strongly related to those of the statistics.
A different classification of frame “errors” is with respect to the time it takes until they are corrected. Some are simply due to time delays in the information from different sources. Such errors can be evaluated after updates. Other errors are either detected in special circumstances – like a survey or a change including that information – or (more or less) never detected. Those errors can hardly be studied; at the least they require special investigations. Small units especially may be subject to an error for a long time.
The updating procedure may sometimes be held back deliberately, as mentioned above in Section 5.3.2 for coherence between short-term and annual statistics in some Member States. Another example is for short-term statistics using the same set of classifications and size measures during the year, used in the UK in order not to add the effects of re-classifications to the within-year-changes. Both stratum and domain are ”frozen”, see further Sections 5.6.1 and 5.7.1.
5.3.4 Under- and over-coverage of the population
There are two types of deviations between the frame population and the target population:
• under-coverage: units belonging to the target population but not to the frame population
• over-coverage: units belonging to the frame population but not to the target population
There is an asymmetry between the two. A consequence of under-coverage is that observations are not collected for a part of the target population. This may imply a bias in the statistics. Over-coverage means that resources are used on uninteresting units. The over- coverage may be regarded as an “extra” domain of estimation, and one of the results (in comparison with no over-coverage) is an increase in uncertainty when estimating the “regular” domains. If the unit’s membership of the target population is not checked, there may be a bias.
For both under- and over-coverage, the resulting inaccuracy depends on the amount of the coverage deficiencies, the ability to detect them, and the counter-actions taken in the estimation procedure.
Furthermore, there may be practical difficulties in distinguishing over-coverage and unit nonresponse. A unit outside the target population that receives a questionnaire may be more or less inclined to return it than a unit belonging to the target – it is easy to return, but on the other hand there seems to be no reason to fill in the questionnaire. Some questionnaires may be returned by the postal authorities because the address is no longer valid – that should, of course, be followed up. See Chapter 8.
5.3.5 Differences within the population
The reasoning that was used in the previous section for the whole population is to some extent also valid for each sub-population. However, under-coverage of one domain is over- coverage for another.
There are some different possibilities here for coverage deficiencies:
• remain undetected (for example an erroneous SIC code remains)
• detected for the sample (or more accurately for the responding units; for example the
number of employees in the questionnaire)
• detected on the population level (for example a general update of SIC codes between
sampling and estimation)
Again, the resulting inaccuracy depends on the amount of the coverage deficiencies, the ability to detect them, and the counter-actions taken in the estimation procedure.
5.3.6 Some comments on frame errors
Even if the construction of a frame population is easy in principle, there is much work in practice with the BR and the frame with regard to births, deaths, organisational changes, contradictory pieces of information, duplicates, mistakes, identification problems, time delays, etc. Identification is important, for example to eliminate duplicates due to different sources. Archer (1995) describes the maintenance of business registers, including some
examples from New Zealand. One statement made is that identifying births typically involves a quarter of the total resources needed.
A close co-operation between the BR and the statistical surveys using it as a frame is important. This includes an understanding on both sides of the different uses. It also means a lot of work on single cases to handle them correctly both over time and in different surveys, for example in cases of reclassifications and reorganisations. Particular care is needed with large enterprise groups which have complex structures and span several different activities. Such entities may cut across different surveys, and the structures are subject to change. It is important that they are monitored closely so that changes can be picked up quickly and handled consistently. In the UK there is a Complex Business Unit to this end. A number of other countries have a similar organisation, some of them also being responsible for all survey data collection.
In the discussion of quality assurance for business surveys by Griffiths & Linacre (1995), frame creation, maintenance, and monitoring is an important part, including illustrations of births, deaths, and time lags.
The term frame error is not always a correct description – coverage deficiency is often more adequate, showing the consequence and not just blaming the frame, for example for not having included mergers in January 1998 in a frame constructed at the end of 1997.
5.3.7 Defining a Business Register covering a time period
The target population has reference times for the units that equal those of the variables, as mentioned above. This means, for enterprises and annual statistics for example, that the enterprises included should not be those that are active at the time of the frame construction but all enterprises that are active during the year, whether active the whole year or during a part of the year only.
If the frame is constructed at the end of the year (see discussions in Sections 5.3.2 and 5.6.1- 5.6.2), the enterprises missing in the frame are “early deaths and late births”, that is broadly those that are (i) no longer active according to the BR but have been active previously in the year, and (ii) not active in the BR but active later in the year. Moreover, with SIC codes referring to a different period than the target calendar year, there will be misclassifications. This shows the frame deficiencies affecting statistics unless actions are taken. A special BR with the purpose of such actions is introduced below.
At some point after the calendar year it is possible – at least in principle and if the information needed has been kept – to combine information from the BR including time stamps, and possibly also from other sources, to derive a new Business Register that refers to the calendar year. In the case of enterprises, it includes all enterprises that have been active at some time during the calendar year. The values of the variables also refer to the full year. If the basic values have reference times that are points in time, some procedure is needed, perhaps a suitably chosen average of values before/during/after the year. The same is possible
for a different period, like a quarter, but due to the time delay, such a register is less likely to be useful.
Sweden has some experience of a BR covering a calendar year and its use, illustrated in Sections 5.7.2-5.7.3. It is then regarded as the best knowledge attained. Statistics based on this BR and another, previous version are compared. This is one way to evaluate effects of frame errors. Furthermore, the improvement of the accuracy through using this BR should be considered together with the efforts involved, to see if the effort is cost-effective.
An “ordinary” BR shows the situation at some point in time, like a snapshot. However, considering that the rate of updating varies between variables and units, it is rather a mixture of snapshots of the units with regard to delineation, SIC code, size measures etc.