CATEGORICAL DATA

Descriptive statistics

Using categorical data, as the name suggests, involves placing things in a limited number of categories. Usually in organizational research these ‘things’ are people, and typical categories include gender and level of seniority. So if we say that we have 34 women and 10 men in an organization, we are referring to categorical data because people are being placed in either one of two categories.

In principle, many different types of categories can be used to classify people. Categories are used for the colour of eyes (green-eyed people, blue-eyed people, brown-eyed people etc.), for the region of the country where people live, or all sorts of other things. In organ-izational research, categories such as gender and level of seniority tend to be selected because these are thought to be the most important. Sometimes the categories exist before the research begins, and gender and level of seniority are examples of this. At other times categories are created during the course of the research. For example, a questionnaire might ask managers to indicate whether they agree or disagree that the organization should be

re-structured. Those agreeing and disagreeing can be viewed as two different categories of people. Whatever categories are chosen, they are normally qualitatively different from each other: men are qualitatively different from women because we cannot reduce all the differences between them to a small number of dimensions such as how tall they are, how socially perceptive they are, or how extro-verted they are.

Because, nowadays, computers are usually used to analyse data, and computer programs designed for statistical analysis are much better at dealing with numbers than with words, categorical data are usually represented numerically. So, instead of telling a computer that someone is ‘male’ or ‘female’, we may tell it that the person has a value of 0 or a value of 1 for gender, and that a 0 for gender indicates that they are male whereas a 1 indicates that they are female. You could represent the information about the ﬁve people mentioned earlier in the way shown in Table 1.1.

In Table 1.1, we have assigned the number 1 to females and 0 to males. The coding is arbitrary – the fact that 1 is greater than 0 does not mean that females are greater than males – we might just as well have coded males as 1 and females as 0. The important thing is that we have a numerical code that indicates which category people are in: in this case the female category or the male category. It is because these numbers simply indicate which category someone is in, and do not imply that one category is somehow greater than another, that we know we are dealing with categorical data.

Another example would be the classiﬁcation of employees according to the part of the country they work in: North, South, East or West. If you counted the number of employees working in these regions and found the following:

– North: 156

– South: 27

– East: 43

– West: 92

you would have categorical data. As with the gender data considered earlier you know that they are categorical because the information you have about each person does not suggest that they are greater than someone else in some way, but merely that they belong to a different category.

Let’s say you decided to code people according to the region they work in as follows:

– North: 0

Table 1.1 Age and gender of ﬁve people, with gender coded 0 and 1

Name Age Gender

You might then ﬁnd that the ﬁrst 10 people you found out about were as shown in Table 1.2. Note that here a ‘case’ is simply a numerical label given to a particular person in order to identify them.

So the ﬁrst person you have information about is labelled case number 1, the second is labelled case number 2, and so on. In most organizational research, data are collected from people and so each case refers to a particular person as it does in this example. However, this is not always true.

For example, sometimes cases, the things that we collect data on, may be physical entities such as machines or, at other times, each case might be a different organization. In this book, for the sake of consistency, and because it is the typical situation in organizational research, cases refer to people.

With categorical data you must always remem-ber that a numremem-ber is being used to specify which category a case belongs in rather than to indicate anything about whether it is, in any sense, greater than, or less than, another case. If you look at the information in Table 1.2, you will, for example, see that the region of case 2 is 3, whereas the region of case 9 is 2. This does not mean that case 2 is in any way ‘greater than’ case 3: it merely means that, according to the codes given, case 2 comes from the West whereas case 9 comes from the East.

Describing and summarizing categorical data

Having established what categorical data are, we can now consider how such data can best be described and summarized. A very helpful way is to represent the number of people in each category as relative proportions. So, if there are 10 men and 10 women in a survey we indicate that the proportions of men and women are equal. Proportions are usually expressed as percentages. Information about the number of cases in each category, and the relative proportion of cases in each category, can be shown in a simple table such as Table 1.3.

Knowing that about 40 per cent of the people are men and about 60 per cent are women makes it easier to get a picture of their relative proportions than just knowing that 15 are men and 22 are women. Similarly, Table 1.4 shows how the proportion of people from each of the four regions considered earlier could be presented.

Table 1.4 shows, again, the per-centages make it easier to appreciate the proportion of people in each region than the simple numbers do. So you might not notice that about half of the employees work in the North unless

Table 1.2 The regions in which people work, coded 0 to 3

Case Region

1 1

2 3

3 2

4 0

5 0

6 3

7 1

8 1

9 2

10 1

Table 1.3 The frequency and percentage of men and women surveyed

Number Percentage

Men 15 40.5

Women 22 59.5

you expressed this information in terms of the percentage of cases in each category as well as the number of cases in each category.

Another useful way of presenting information about proportions is with a graph, and a useful graph for categor-ical data is the pie chart. A pie chart is shown in Figure 1.1.

Figure 1.1 clearly shows the rela-tive proportion of people from the four regions and, arguably, has more immediate impact on someone interested in this informa-tion than Table 1.4 in which the same informainforma-tion is presented in numerical form.

Another way of representing categorical data graphically is with a bar chart. An example of a bar chart is shown in Figure 1.2. It is based on the data used in the pie chart, but here the number of cases in each category is represented by the height of each bar. While it does not convey the relative proportions of the number of people in the four regions as clearly as the pie chart, it does enable us to compare the numbers of people in each region more easily. This illustrates the important point that the choice of a graph must be driven by the information that the writer is trying to communicate to the reader.

So far in this chapter you have been looking at how to describe information about just one categorical variable at a time. However, we often wish to examine the relationship between two categorical variables. Imagine that you were interested not only in the number of employees working in each region of the country, but whether this was related to the gender of employees. You can set out information about both region and gender simultan-eously in what is called a contingency table such as that shown in Table 1.5. In presenting information in a contingency table, it is often helpful to include percentage information too (see Table 1.6).

Table 1.4 The frequency and percentage of people working in four regions

Region Number Percentage

Figure 1.1 The proportion of people working in four regions

In Table 1.6 the proportion of males who are working in each region can be compared with the proportion of females working in each region. Presenting information in this way can draw attention to potentially interesting relationships. For example, here, the fact that the proportion of males working in the East is about six times as high as the proportion of females working in the East may be important.

180 160 140 120 100

Frequency 80

60 40 20 0

North West

Region

East South

Figure 1.2 The number of people working in four regions

Table 1.5 The regions in which people work, and their gender

Males Females

North 26 130

South 14 13

East 31 12

West 22 70

Table 1.6 Breakdown of the region in which people work, and their gender: frequencies and percentages

Male Female

Number Percentage Number Percentage

North 26 28.0 130 57.8

South 14 15.0 13 5.8

East 31 33.3 12 5.3

West 22 23.7 70 31.1

In document Chris Dewberry-Statistical Methods for Organizational Research_ Theory and Practice (2004) (Page 29-34)