Summarizing and Displaying Categorical Data

(1)

Categorical data can be summarized in a frequency distribution which counts the number of cases, or frequency, that fall into each category, or a relative frequency distribution which measures the percentage of the data set, or proportion, within each category.

Categorical data can be visualized in a bar graph.

Bars, labelled by category, have heights determined by the frequency (or relative frequency) of data in that category. Bars should be separated by gaps across the display. [Excel: Data > PivotTable; Insert >

Charts > Column > Clustered Column]

A pie chart represents categories with labeled sec- tors in a circle; the proportion of data in that category equals the percentage of the area of the circle assigned to that sector. Its best not to use a pie chart when the number of categories is large. [Excel: Data >

PivotTable; Insert > Charts > Pie > Pie]

(2)

Summarizing Categorical Data

The most common categorical data involves a variable with only two possible values: either the individual be- ing measured possesses some characteristic of interest, or it doesn’t. The resulting categories can be referred to as either Success or Failure.

• proportion of successes (p)

statistic that summarizes the data set by recording the proportion of data values which are Successes:

p = number of Successes

n ,

where n represents the number of values in the data set

• proportion of failures (q)

statistic that summarizes the data set by recording the proportion of data values which are Failures:

q = number of Failures

n ;

since there are only two categories, we always have that q = 1 − p.

(3)

Displaying Quantitative Data

Numerical data can be visualized with a histogram.

Data are separated into (usually equal) intervals along a numerical scale, called classes, then the frequency distribution of data in each class is tallied. Bars are built over each interval with heights, measured along a vertical scale, given by the frequency (or relative frequency) of data within each class.

[Excel: Data > PivotTable; PivotTableTools

> Options > Group > Group Field; Insert >

Charts > Column > Clustered Column; Format Data Series > Series Option > Gap Width >

No Gap]

A polygon display is obtained by replacing the bars of a histogram with a broken line joining points which are plotted at the midpoints of tops of the bars for each class interval.

[Excel: build histogram, then. . . Change Series Chart Type > Line > Line with Markers]

(4)

A cumulative frequency distribution records the number of observations that fall at or below the upper limits of each class; a cumulative relative frequency distribution records the proportion of observations that fall at or below the upper limits of the classes. The histogram-like display of the cumulative (relative) frequency distribution formed by erecting bars over each class is called an ogive.

[Excel: build polygon, then. . . PivotTable Field List > Values > Value Field Settings >

Show Values As > % Running Total In]

A quick way to display numerical data by hand is with a stem-and-leaf display. All but the rightmost digit (or digits) of the measurement become stems; stems head rows in which the remaining digit(s), the leaves, are listed, lined up vertically in columns. (List all inter- mediate stems, even if they contain no leaves!)

(5)

Describing Quantitative Data: Features of Inter- est

• The shape of a histogram or stem-and-leaf describes the distribution of the data – where data is concentrated and how it spreads out across the entire range of values.

• Where is the center of the distribution located?

• How much spread is there in the distribution? How tightly is the data clustered about the center?

• Is there more than one cluster, or mode? Is the data unimodal, bimodal, multimodal? Note: The loca- tion of modes can change with the scaling unit of a display (width of a bar).

• Is the distribution uniform (has a flat contour), indi- cating that every value is (roughly) equally represented?

Is it roughly symmetric, with equally frequent values on either side of the center (the distribution to the right of the center is the mirror image of what appears to the left)? Or is it skewed (heaver on one side of the center than the other) to the left or right, in the direction of the tail (region of most extreme values)?

(6)

Displaying Paired Numerical Data

Paired numerical data sets are quite common in statisti- cal practice. This occurs when two Whats are measured for the same set of Whos. Often the goal is to determine whether values of one of the variables are affected by changes in values of the other variable.

• response (dependent) variable

measures a characteristic of interest in a study; the aim is to determine how this variable is affected by variation in some other quantity, namely, an . . .

• explanatory (independent or predictor) variable a variable which may turn out to influence the outcome of the response variable

• scatterplot

display of paired data as points (x, y) in a coordinate plane; here, x represents the explanatory variable, y the response variable

[Excel: Insert > Charts > Scatter >

Scatter with only Markers]

(7)

To investigate the possible relationship between the variables, look for overall patterns in the plot and be on the watch for outliers (points located far from the region where most data are clustered) or deviations from the overall patterns

• association

tendency for change in one variable to be accompanied by change in the other

• direction

variables display a positive association if larger values of one tend to be paired with larger values of the other, and a negative association if larger values of one tend to be paired with smaller values of the other

• form

shape of the plot, including clusters of data points; lin- ear relationships are most important

• strength

how closely the points conform to the overall shape of the plot