Data/Model/Methods - Basic Outline for an Empirical Paper

Writing Professional Papers

8.6 Basic Outline for an Empirical Paper

8.6.3 Data/Model/Methods

This section of the paper gets into the nitty-gritty of the analysis you are going to do. In the previous section, you identified the context in which you are testing your theory. In this section, you discuss specifically the data that you have from that context, the statistical models you plan to estimate, and the specific methods you will use to estimate them.

Most papers have a specific outcome of interest. We can generally think of this as the dependent variable or the outcome variable. It is the measure of the thing we are trying to explain. To continue with our example, the dependent variable of interest might record how individual survey respondents voted in a recent election. It is generally best to describe how this variable is measured first given that it is the primary outcome of interest.

Your theory generated one or more predictions about factors that would be related to the outcome of interest. In particular, we thought that voters would be likely to vote for candidates who shared their party affiliation. This primary independent variable should be described next.

Even though your theory focuses on partisanship, you know from the literature that other factors likely influence voting behavior as well. You may want to control for these factors in your statistical model, which will require that you describe how these additional variables are also measured.

The description of each individual variable included in the analysis must be precise. All too often, authors simply say they control for things like education, income, and/or race without explicitly defining how they do so or how these variables are measured. Authors often fail to consider what assumptions they might be making when they choose how to measure variables as well.

For example, suppose we measured the partisan attachment of voters as -1 if they are Democrat, 0 if they are an Independent, and +1 if they are a Republican. Such a measurement strategy assumes that all voters within any one of these cate- gories are equivalent to all of the other voters in the same category. It also assumes that the magnitude of the difference between Democrats and Independents is the same as the difference between Republicans and Independents. Maybe a scale that distinguishes between strong and weak partisans would be better? Maybe separating the major parties into two dummy variables – one indicating whether a respondent is a Democrat or not and one indicating whether a respondent is a Republican or not – would be better? These decisions are important, and writing clearly about them is essential.

Once you have defined and explained each variable, you can present the specific statistical model you plan to estimate. Frequently, this takes the form of an equation with the dependent variable expressed as equal to one or more independent variables, with each one having some unknown parameter attached to it. We have data measuring the variables in the statistical model. What we are estimating are values for those parameters. Again, in many cases this takes the form of a regression equation like this:

vote_i=β0+β1partyi+β2incomei+β3f emalei+εi (8.1)

Wherevote is the dependent variable, and party, income, and female are the independent variables. Eachβ is a parameter to be estimated, andε represents

a residual or error term.

Once you have written the statistical model, you can translate your hypotheses into specific predictions about the parameters you are estimating. In this case, if the dependent variable is coded equal to 1 if a respondent reported voting for the Republican candidate and is equal to 0 if a respondent reported voting for the Democratic candidate, and ifparty is measured from -1 if they are Democrats up to +1 if they are Republicans, then our theory would predict thatβ1would be

positive and statistically significantly different from zero.

Writing down the statistical model and translating your hypotheses in words to hypotheses about parameters in the statistical model allows the reader to link the statistical analysis directly to the theory motivating the paper. This translation gives the analysis meaning. It allows the author to make knowledge claims based on the data that are relevant to the theory. Without this linkage, the findings pro- duced by the statistical analysis could be meaningful, meaningless, or misleading, and it would be hard to tell the difference.

After writing down the statistical model, you need to describe how you’re going to estimate the parameters of the model. If the method is widely known and used, you can simply report in one sentence what you are doing. You might be estimating an ordinary least squares regression, or maybe estimating a logit model via maximum likelihood estimation. If the method is not widely used, and cer- tainly if it is new to your field of study, you need to provide more information about it. In any case, you need to provide a justification for the method you are using. Why is it appropriate for your analysis? If there are other plausible alterna- tives, why did you choose the one that you did? What might be the consequences of choosing a different plausible method?

When writing down a statistical model and choosing a method of estimation for it, you should consider the assumptions you are making. Assumptions associ- ated with statistical models must be compatible with assumptions you are making in your theory. If not, your statistical model as specified and the results it produces will be biased and likely misleading.

This seems like something that should be easy, but I would say that the most common problem I see in papers is a lack of compatibility between the theory as written in the statistical model as written and/or estimated. I have seen count- less examples where authors write in words that the effect of one variable on another might be different based on yet another variable. However, they simply had included a control for the final variable when clearly the words imply the need for a multiplicative interaction term. I have seen work that emphasizes early on that a particular variable does not follow a normal distribution, but later uses that variable in a statistical model that assumes normality. Others often fail to notice assumptions about linear relationships between variables that are built into their statistical models or in how individual variables are measured. Finally, others misuse statistical techniques believing that they are fixing some problem that the

technique does not in fact address.

I could go on, but that implies writing a book about methods, a task I will leave to others. I will note that as a graduate methods instructor of more than 20 years, I spent a lot of time emphasizing the importance of getting the theory right and then getting the statistical model and method aligned with that theory. I urge you to pay careful attention to this aspect of any paper you write or any paper you read.

Finally, you should acknowledge any problems or limitations with the data that might complicate the analysis. Do you have any missing data? Might the data be clustered in some way? Could there be any spatial or serial correlation? The potential for problems is almost limitless, but any that might have consequences for your analysis should be addressed. Major issues should be discussed in the body of the paper. Minor issues can be addressed using footnotes. Either way, you need to note the problem(s) and report what you did in response.

In document Tom_s_Comments_Carsey_book_9-7-2020.pdf (Page 136-139)