Methods Training
7.4 Which Methods To Study?
7.4.4 The Data Generating Process
Good empirical research ties theory and data analysis closely together. This is true whether the data and analysis methods are qualitative, quantitative, experi- mental, etc. A theory provides an explanation of why some process works the way it does. Why do voters vote the way they do? Why are some countries more likely to experience war than others? The answers to questions like these are theoretical statements. These things happen because [Fill in the Blank]. What you use to fill in the blank is a theory.
The best way to ensure a close connection between the theory and the data analysis is to think of the outcome of interest as data generated by some process. The outcome of interest might be how citizens vote, if countries go to war, if particular policies are passed, or how strongly someone feels about their group identity. These outcomes are the result of some process. It is that process that you are trying to understand and explain with your theory and analysis. In other words, you want to develop a theory and complementary statistical model of the data generating process you are studying.
You should be able to write down your theory in three ways: 1) in words, 2) as a statistical model, and 3) a figure or illustration. You should be able to translate back-and-forth between these three representations. Let’s look at an example.
Suppose you have some outcome, Y, that you believe is related to some other factor, X, but that it also has some random or unpredictable component. We can be more specific, and we should be, by being more precise about what we mean by “is related to.” To keep it simple, let’s say that we believe when X takes on higher values, Y is also likely to take on higher values. In other words, we expect there to be a positive relationship.
A statistical model consistent with this description is presented in Equation 7.1:
Yi=β0+β1Xi+εi (7.1)
Where Y is the dependent variable and X is the independent variable. Eachβ
is a parameter to be estimated. The first parameter,β0, is often called a constant
when the independent variable equals 0. The second parameter, β1, captures the
relationship between X and Y. Specifically, it represents the marginal effect of X on Y. This means that when X increases by one unit, Y is expected to change on average by an amount equal to β1. Finally, ε represents a residual or error term
that is assumed to be random. Y, X, and the error term are also scripted by i
because every individual observation in your data set will have its own value of Y, X, and the error term. The parameters are not subscripted this way because you will be estimating a single average value for each one of them.
Equation 7.1 is somewhat consistent with our theory as expressed in words, especially if β1 turns out to be positive. However, there is already some incon-
sistency between our theory in words and our statistical model. In particular, our statistical model embodies a more focused definition of the relationship between X and Y. The statistical model assumes that Y is a linear function of X. This is much more specific than saying Y is related to X. It is even more specific than saying Y is positively related to X.2
The expected positive linear relationship between X and Y can be illustrated in Figure 7.1. This figure presents a hypothetical two-way scatterplot. Values of X are represented on the X axis while values of Y are represented on the Y axis. The individual dots in the plot represent data points, while the black solid line within the plot represents the linear relationship between X and Y.
The lesson here is to make sure that we can write about what we expect in words, represent that as a statistical model, and as a graph or figure, and then ensure that all three representations are compatible, if not identical. Generally speaking, mathematical formulas like statistical models require more precision than do verbal descriptions. Regardless of where you start, you should be able to produce all three representations of the data generating process implied by your theory.
Now let’s make things a bit more complicated. Suppose that you believe that Y is not only a function of X, but that gender plays a role as well. You think that males and females differ from each other as it relates to Y. Luckily, you have another variable in your data set named female that is coded as a 1 for people who are female, and as 0 for those who are not.
2This statistical model makes additional assumptions, as would any particular method used
to estimate the statistical model, such as Ordinary Least Squares. I am ignoring all of those complications for now to simplify the presentation.
0 2 4 6 8 10
0
2
4
6
Simple Linear Relationship
Values for X
V
alues f
or Y
Figure 7.1: Illustrating a possible simple positive linear relationship between X and Y.
To say that Y “is a function of” X andfemaleis not a very precise statement. Nor is it precise to say that Y differs for males compared to females. How exactly is Y a function of these two predictors? If Y is a linear additive function of X and female, we could write a statistical model like Equation 7.2:
Yi=β0+β1Xi+β2f emale+εi (7.2)
Equation 7.2 would imply a scatterplot with two lines, one representing the relationship between X and Y for females, and one representing the relationship between X and Y for males. If males and females did not differ at all from each other regarding Y, these two lines would be plotted on top of each other and you would only see one. However, if males and females do differ from each other regarding Y, you would expect two distinct lines. Because the statistical model is linear and additive, these two lines would be parallel to each other (e.g. they would have the same slope). Whether the line for females is above or below the line for males would depend upon whether females or males have higher values of Y on average, after accounting for X.
Multiplicative Interaction Terms
However, suppose you think that the reason males and females differ in re- gards to Y is at least in part because the relationship between X and Y is different for females compared to males. That would imply a statistical model more like Equation 7.3, which includes a multiplicative interaction term.
Yi=β0+β1Xi+β2f emale+β3(X×f emale) +εi (7.3)
In Equation 7.3, the marginal effect of X on Y now depends upon the value of the variablefemaleaccording to Equation 7.4:
Marginal Effect of X on Y =β1+β3(f emale) (7.4)
Thus, whenfemaleequals 0, the marginal effect of X on Y simplifies down to justβ1. However, whenfemaleequals 1, the marginal effect of X on Y equals
Equation 7.3 wouldALSOimply a scatterplot with two lines, one representing the relationship between X and Y for females, and one representing the relation- ship between X and Y for males. However, we wouldNOT expect the two lines to be parallel (e.g., they wouldNOT have the same slope). In fact, Equation 7.4 tells us exactly what the slopes of the two lines should be.3
The models represented by Equations 7.1, 7.2, and 7.3 represent three different models of the data generating process, which means they also represent three different theories. Of course, there are many more models we could consider – indeed, an infinite number are available. The point remains that you need to be able to articulate a theory in words and write down a statistical model you can estimate that comports with that theory. If your theory in words implies Equation 7.3, but you estimate the model in Equation 7.2, your statistical results will be meaningless at best, but more likely misleading.
The single biggest problem I see in student papers, conference papers, and even published articles is a fundamental incompatibility between the theory de- scribed in words and the statistical model specified that is supposed to be used to evaluate that theory. As the example above illustrates, simply saying that Y is a function of X and whether or not someone is female is too vague because multiple statistical models are compatible with that one statement. How can the reader know if your statistical model is appropriate if your verbal description is too vague? I just illustrated two statistical models that are compatible with that phrase, but if you consider possible nonlinear relationships and other forms of the relationship between Y and these two predictor variables, an infinite number of statistical models are possible. When seen in this light, the verbal statement that Y ”is a function of” X and whether or not someone is female is almost meaning- less.
Think about the social/political process you are studying as a data generating process. What is the outcome of interest that is being generated? What is the set of
3Inclusion of the interaction term also means that the marginal effect of femaleon Y de-
pends upon the value of X. Interpreting interaction terms and their statistical significance properly requires more than just looking at the output your computer generates. You need to compute meaningful marginal effects. You also need to compute proper standard errors associated with those marginal effects. Generally the best way to present results from models including interac- tion terms is to generate marginal effects plots. Incorrect interpretation of interaction terms is by far the single most common mistake I see in scholarly research. If this were a book about methods, I would spend more time on it here. Go read about this and make sure you understand them.
factors that combine to generate this outcome, and how exactly do they combine to produce it? Every statistical model comes with extremely precise answers to these questions. It is critical that the precision of the statistical model is matched by a precision in the verbal description of the theory. I think a good way to ensure that match is to draw a picture or figure of how you think the data generating process works. This is a fundamental component of good research methods training.