Section E: Data Processing and Analysis - RSS Ordinary Certificate in Statistics

The questionnaires or forms that we design are used to collect accurate data. Survey organisations such as ONS will perform several steps to transform the data collected into information that can be assimilated and used. These steps are known as data processing.

Note: The following is a guide to what typically happens in ONS in terms of

data processing and analysis. The process used by other organisations may vary.

Coding Structures

Before the data we collect is entered onto a computer for data processing it must be coded. Coding involves allocating a number to each of the possible responses provided to a closed question, or allocating a code to the response of an open question. A code is quicker to enter onto computer systems than text responses, thus data processing is more efficient. Coding also aids the data analysis stage, as it categorises the responses given and enables the frequency of selection to be calculated.

Note that codes should cover all possible responses and should not overlap. Coding can be carried out at one of three stages:

(a) Before the survey

Where closed questions are being used, a code can be assigned to each of the possible responses on the questionnaire before the survey is sent out. This is known as pre-coding. The responses are then ready for data entry as soon as the questionnaire is returned to the office. However, the code may detract the attention of the respondent away from the question so where you position the codes on the questionnaire is important (a common place is on the right hand side of the response box and in a small or greyed out font). For example:

How well would you say your dietary needs (e.g. low fat, vegetarian, vegan, kosher, etc.) are catered for by the ONS canteen?

Please tick one box only.

Very well 1

Quite well 2

Not very well 3

Not at all 4

(b) During the interview

Where open questions are being used during an interviewer led questionnaire, the interviewer can code the responses as they are given. Note however, that the coding schedule must have been established before the interview takes place. The interviewer is able to clarify the response and provide appropriate coding. It is possible for the interviewer to bias the respondent, or interpret detailed or complicated responses in line with personal prejudices, possibly providing the incorrect code.

(c) After the survey

Where open questions are being used in a self-completion survey, a range of responses are received and coded after the survey has been completed. All given answers will therefore be considered in the coding. However, responses to open questions may be incomplete or vague and thus difficult to code.

Missing Value Codes

(a) Text Responses

The codes used in practice for a missing text response, for example, where a respondent has completely missed a question out, will vary but they may take a ‘.’, or ‘#’.

(b) Numeric responses

For numerical responses it is important to distinguish between a missing value and a returned zero. Hence it is also good practice to provide a code for a missing numerical value, such as ‘99’ or ‘999’. This will aid analysis later on and will prevent confusion with returned responses.

Coding and Data Analysis

Coded answers are easily analysed using computer software packages such as SPSS (Statistical Package for Social Sciences) or SAS (Statistical Analysis System). Both of these packages are used extensively across the Government Statistical Service for statistical analysis.

These packages allow researchers/statisticians to produce a wide range of summary statistics, like the mean or standard deviation, and tables and graphs that can be used in reports. Note, however, that operators need to be trained to use the software packages in order that they are used correctly and that outputs are interpreted appropriately.

Data Capture

This is the process where by the data we collect on questionnaires or forms is transferred to an electronic file and subsequently put onto the computer. Before we can complete this step, we must ensure that the questionnaire or form is ready for data capture. The questionnaire is reviewed by someone to

ensure that all of the minimum required data have been reported, and that they are understandable.

There are several methods use for data capture:

• Batch Keying is one of the oldest methods of data capture. It

involves the manual keying of data onto the computer. During this keying period no immediate editing takes place so validity and range edits need to implemented to ensure quality keying. This does not mean the data are being re-edited, but if a field is numeric and alpha characters are entered instead, the error will be flagged. • The scanning of questionnaires to capture the information that has been supplied on them is commonly used within ONS. The main system used is Intelligent Character Recognition (ICR). Within this system questionnaires are designed so that responses can be distinguished from the actual questionnaire.

For successful data capture through both batch keying and scanning there is a need to consider how the information we collect will be interpreted when designing a questionnaire. We can do this by considering the format we want our responses in. For example, if we ask for date of birth then we should state whether we want this in DD/MM/YYYY or some other format.

Some of the questionnaires that we pass through ICR will fail on scanning, that is they can not be interpreted. A common example of this is where respondents enter a six but the scanner picks it up as a zero. Whilst ICR is adapt at picking up characters, some will ultimately fail. Where this is the case a team of data experts will manually input the information on the questionnaire onto a computer.

• For data recorded by an interviewer, this can sometimes be entered directly onto a computer. These files can then be transferred electronically to the relevant system

Once an electronic file of all of the information collected has been created, the data is passed through a series of validation and automatic editing rules. One automatic editing rule used at the ONS is automatic rounding. Much of the turnover data that ONS collects is asked for in £’000’s, so an automatic editing rule has been set up to check that a respondent’s data has been reported in the correct format. Where it hasn’t, the rule is programmed to correct the respondent’s turnover and transform it into the appropriate form. For example, if a respondent returns a monthly turnover figure of £1,000,000 when £1,000 is far more likely (i.e., the respondent has written their figure in full, rather than in £000), the system will automatically adjust this. All data are also passed through a set of validation gates. These gates check the feasibility of the data and highlight possible errors. The data that fails

validation are passed to a team of analysts who contact the respondent to confirm the data or query it. They then correct it where necessary.

Data Analysis

Once all data has been edited in this manner it will be passed to a results team who will analyse it further. The main interest for this team is how the individual data will collate together to form the key outputs of the survey. Here they will use statistical computing packages such as SAS to analyse the data sets, to find out about aspects of the data such as the trend, irregular movements and outliers or freak values.

After the data has been analysed, the final results are produced, published and disseminated.

Outliers or Freak Values

An outlier is a response that is unusual in comparison to responses from other respondents. That is the response appears to be inconsistent with the rest of the population. It may be so different from other responses that is arouses suspicion. Outliers are identified after all the data has been processed and has passed through validation and editing.

Since outliers are unusual responses there should be a small number of outliers in any dataset. There are two main types of outlier:

• Representative outliers – representative outliers are genuine

values which cannot be assumed to be unique in the population. • Non-representative outliers – non-representative outliers are

unique or incorrect data values which should be looked at and treated by editing and imputation systems.

Outlier theory has been developed to deal with representative outliers only, that is, it assumes that all data is correct.

Detecting Outliers

One of the easiest ways to detect an outlier is through plotting all data on a scatter plot. Any observations that appear to fall away from the bulk of the data can be viewed as outliers. There are also two mathematical ways of detecting outliers.

• Distance from the mean

Calculating the distance from the mean is the most common method used for detecting outliers. It works by measuring the relative distance between a response and the average response. Those values that appear to have a large distance from the average value are deemed to be outliers.

(Please note that you do not need to know how to calculate distance from the mean, but you do need to recognise it as a method of detecting an outlier)

• Trimming

Trimming is a relatively simple method for detecting outliers. The responses are sorted into ascending order and the top x% and bottom y% of the responses are identified as outliers. The upper and lower percentages are pre-determined by the researcher, but typically they are between 2% and 20%. This method tends to identify a large number of outliers, depending on the number of responses, and the values for the upper and lower percentages.

Dealing with outliers

Before any moves are made to deal with an outlier the first step should be to re-check the results; a recording error may be the explanation, or two or more data items may have been mixed up. Data entry to a computer may also be faulty (e.g. scanning or keying errors).

If an outlier is confirmed then action can be taken to reduce its effect on the survey results. Analysis may be carried out with and without the doubtful values to see if omitting a possible outlier makes a difference to the survey results. But it is logically dangerous to omit observations unless there is a valid, likely reason for doing so. The more extreme values give information on how variable the data is, which in some studies is just as important as location.

In document RSS Ordinary Certificate in Statistics (Page 55-60)