Check the data for errors and missing attributes. No matter how carefully evaluators have collected, recorded, and transformed the data to an analysis medium, there will be errors. They can detect and remove some by simple checks. Computer programs can be written, or may already exist, for checking the plausibility of attributes. For example, the gender variable has two attributes, male and female, and therefore two possible numerical values, say 0 and 1. Any other value is an error and can be readily detected. In a similar way, evaluators can check all variables to ensure that the attributes in the data base are reasonable.
They can detect other errors by contingency checks. Such checks are based on the fact that the attributes for some variables are conditional upon the attributes
1Flexibility usually exists on the fuzzy border between ordinal and
interval variables. Analysts often treat an ordinal variable as if it were measured at the interval level. In fact, some authorities (see Kerlinger, 1986, pp. 401-3, for example) believe that most psychological and educational variables approximate interval equality fairly well. In any case, instrument construction should take account of the measurement level desired.
of other variables. For example, if a medical case has “male” as an attribute for gender, then it should not have “pregnant” as an attribute of physical condition. These kinds of if-then checks on the attributes are also relatively easy to automate.
A missing attribute, where none of the acceptable attributes for the variable is present, is more difficult to deal with. Evaluators have four options: (1) go back to the data source and try to recover the missing attribute, (2) drop the case from all analyses, (3) drop the case from any analysis involving the variable in question but use the case for all other analyses, and (4) fill in a substitute value for the missing attribute. Considerations involved in dealing with missing attributes are treated by many writers. (See, for example, Groves, 1989; Little and Rubin, 1987; and Rubin, 1987.)
When evaluators have used probability sampling with the aim of estimating population parameters from sample results, overall nonresponse by units from the sample is an especially important problem. If the nonresponse rate is substantial and if it cannot be shown that the respondents and nonrespondents are probably similar on variables of interest, doubt is cast on the estimates of population parameters.
Consequently, an analysis of nonrespondents will be needed. See Groves (1989) for an introduction to the literature on nonresponse issues.
Explore the data. A number of statistical methods have been specifically developed to help evaluators get a feel for the data and to produce statistics that are relatively insensitive to idiosyncracies in the data. Some of these, like the stem-and-leaf plot and the box-and-whiskers plot, are graphic and especially useful in understanding the nature of the data.
Chapter 7 Avoiding Pitfalls
in Tukey, 1977; Hoaglin, Mosteller, and Tukey, 1983, 1985; Velleman and Hoaglin, 1981; and Hartwig and Dearing, 1979.)
Fit the analysis methods to the study question and the data in hand. The appropriateness of an analysis method depends upon a number of factors such as the way in which the data sources were selected, the measurement level of the variables involved, the distribution of the variables, the time order of the variables, and whether the intent is to generalize from the cases for which data are available to a larger population. Some factors, like the measurement level, must be considered in every data analysis, while others, like time order, may be germane only for certain types of questions—in this instance, a question about a causal association.
When evaluators consider two or more different analysis methods, the choice may not be obvious. For example, with interval level measurement, the median may be preferable to the mean as a measure of central tendency if the distribution is very asymmetrical. But asymmetry is a matter of degree and a little error from asymmetry may be acceptable if there are strong advantages to using the mean. Or it may be easy to transform the variable so that near symmetry is attained. Statistical tests that indicate the degree of asymmetry are available, but ultimately the evaluators have to make a judgment.
“The data don’t remember where they came from.” These words of a prominent statistician underscore the point that the data analyst must be mindful of the process that generated the data. We can blindly apply a host of numerical procedures to a data set but many of them would probably not be appropriate in view of the process that produced the data. For example, the methods of statistical inference apply only to data
generated by a random process or one that is “random in effect.” (For a discussion of the circumstances in which statistical inference is appropriate, see Mohr, 1990, pp. 67-74.) Since the data do not remember how they were produced, the analyst has to ensure that the techniques are not misapplied.
Monitor the intermediate results and make analytic adjustments as necessary. Even with good planning, it is not possible to foresee every eventuality. The data in hand may be different from what was planned, or preliminary analyses may suggest new questions to explore. For example, the distribution of the data may take a form not anticipated so that analytic
transformations are necessary. Or, a program may have an unanticipated effect that warrants a search for an explanation. The analyst must scrutinize the intermediate results carefully to spot opportunities for supplementary analyses as well as to avoid
statistical procedures that are not compatible with the data.