THE DATA FILE

First steps with SPSS 8 for Windows

Before you can analyze data, you need to create a file which holds them. To illustrate the way in which these files are produced, we will use an imaginary set of data from a questionnaire study which is referred to as the Job Survey. The data relating to this study derive from two sources: a questionnaire study of employees who answer questions about themselves and a questionnaire study of their supervisors who answer questions relating to each of the employees. The questions asked are shown in Appendix 2.1 at the end of this chapter, while the coding of the information or data collected is presented in Table 2.1. The cases consist of people, traditionally called respondents by sociologists and subjects by psychologists whose preferred term now is participants. Although questionnaire data have been used as an example, it should be recognized that SPSS and the data analysis procedures described in this book may be used with other forms of quantitative data, such as official statistics or observational measures.

As the data set is relatively large, it may be more convenient to let skilled personnel enter the data into a file for you if you have access to such a service. If you do, they may enter it into what is called a simple text or ASCII file.

18 First steps with SPSS 8 for Windows Table 2.1 The Job-Survey data

ASCII stands for American Standard Code for Information Interchange and is widely used for transferring information from one computer to another. You then read this data file into SPSS. If you do not have access to such a service or if the data set is small, then it may be easier for you to enter the data directly into an SPSS window called Data Editor. Both these procedures are described later on in this chapter.

With a simple text file, the data are put into a space which consists of a large number of rows, comprising a maximum of eighty columns in many computers. Each column in a row can only take one character such as a single digit. The data for the same variable are always placed in the same column(s) in a row and a row always contains the data of the same object of analysis or case. Cases are often people, but can be any unit of interest such as families, schools, hospitals, regions or nations.

Since it is easier to analyze data consisting of numbers rather than a mixture of numbers and other characters such as alphabetic letters, all of the variables or answers in the Job Survey have been coded as numbers. So, for instance, each of the five possible answers to the first question has been given a number varying from 1 to 5. If the respondent has put a tick against White/European, then this response is coded as 1. (Although the use of these categories may be questioned, as may many of the concepts in the social sciences, this kind of information is sometimes collected in surveys and is used here as an example of a categorical (nominal) variable. We shall shorten the name of the first category to ‘white’ throughout the book to simplify matters.) It is preferable in designing questionnaires that, wherever possible, numbers should be clearly assigned to Table 2.1 Continued

20 First steps with SPSS 8 for Windows

particular answers so that little else needs to be done to the data before they are typed in by someone else. Before multiple copies of the questionnaire are made, it is always worth checking with the person who types in this information that this has been adequately done.

It is also important to reserve a number for missing data, such as a failure to give a clear and unambiguous response, since we need to record this information. Numbers which represent real or non-missing data should not be used to code missing values. Thus, for example, since the answers to the first question on ethnic group in the Job Survey are coded 1 to 5, it is necessary to use some other number to identify a missing response. In this survey all missing data except that for absenteeism have been coded as zero since this value cannot be confused with the way that non-missing data are represented. Because some employees have not been absent from work (i.e. zero days), missing data for absenteeism could not be coded as ‘0’. Instead, it is indicated by ‘99’ since no employee has been away that long. Sometimes it might be necessary to distinguish different kinds of missing data, such as a ‘Don’t know’ response from a ‘Does not apply’ one, in which case these two answers would be represented by different numbers.

It is advisable to give each participant an identifying number to be able to refer to them if necessary. This number should be placed in the first few columns of each row or line. Since there are seventy participants, only columns 1 and 2 need to be used for this purpose. If there were 100 participants, then the first three columns would be required to record this information as the largest number consists of three digits. One empty or blank space will be left between columns containing data for different variables to make the file easier to read, although it is not necessary to do this.

Since all the data for one participant can be fitted on to one line using this fixed format, only one line needs to be reserved for each participant in this instance and the data for the next participant can be put into the second line. If more than one line were required to record all the data for one participant, then you would use as many subsequent rows as were needed to do so. In this case, it may also be worth giving each of the lines of data for a particular participant an identifying number to help you read the information more readily, so that the first line would be coded 1, the second 2, and so on. Each line or row of data for a participant is known as a record in SPSS.

The first variable in our survey and our data file refers to the racial or ethnic origin of our respondents. Since this can take only one of six values (if we include the possibility that they might not have answered this question), then these data can be put into one column. If we leave a space between the participant’s two-digit identification number and the one-digit number representing their ethnic group, then the data for this latter variable will be placed in column 4. Since the second variable of gender can also be coded as a single digit number, this information has been placed in column 6. The third variable of current gross annual income, however, requires five columns of space since two

participants (47 and 65) earned more than £10,000 and so this variable occupies columns 8 to 12 inclusive (please note that the comma and pound sign should not be included when entering the data).

A full listing of the variables and the columns they occupy is presented in Table 2.2. The data file is named jsr.dat which is an abbreviation of ‘job survey raw data’. Since SPSS accepts letters written in capitals or upper case (for example, JSR.DAT) and small or lower case (for example, jsr.dat), lower-case letters will be used to make typing easier for you. Restrictions and conventions on the form of names will be described later in this chapter.

In document Quantitative Analysis With SPSS (Page 34-38)