• No results found

Simulations

Methods Training

7.4 Which Methods To Study?

7.4.5 Simulations

Computer simulations are powerful tools for modern social science research. A simulation turns your computer into an experimental laboratory where you con- trol all aspects of the environment. You can set the conditions, run the simulation, and evaluate the patterns that you see. Then, like any experiment, you can change one condition, rerun the simulation, and compare the patterns between the first and second executions of the simulation to evaluate the impact of the condition you changed.

Effective use of simulations requires that you be able to write down a data gen- erating process with mathematical precision so you can write the proper computer code. Thus, if you skipped the previous subsection titled The Data Generating Process, go back and read it! Simulations have a number of uses that build off the ability to write down a data generating process. As mentioned above, a former graduate student of mine, Jeffrey Harden, and I wrote a book that explores a num- ber of features of simulations in quite a bit of detail (Monte Carlo Simulation and Resampling Methods for Social Science. SAGE (2014)). Thus, I will only briefly review several uses of simulations here.

Generating Synthetic Data

Oftentimes, researchers lack the kind of data they need for some aspect of their research. In other situations, data is available but access to it is restricted. In these and other situations, computers can be used to generate synthetic data that has the desired attributes. Depending on the methods used, synthetic data can sometimes under-represent the level of complexity in actual data as well as the possibility of outliers. It will be up to you to determine whether that is the case in your situation,

and if either would present a problem in your particular study.

One of the ways simulated data has been used in political science is through the simulated case-control method. The logic behind this method is to use a com- puter to generate a comparison case to be matched with a real case for which you have data. The idea is to generate a synthetic case that is comparable to the real case in every way except for the single treatment or attribute you are studying. The logic of this method rests on the potential outcomes framework noted above. For example, California is one of two states that has recently adopted what is called the top-two primary system for its state legislative elections. In this sys- tem, candidates from all political parties run in a single primary election. The two candidates that receive the most votes in this primary, regardless of their party affiliation, move on to face each other in the general election. Thus, you can have general election contests where both candidates are from the same political party. If a scholar wanted to measure the impact of adopting this primary system at the state level, they would need a comparison case. One option would be to select one or more states that do not have this system that are most similar to California. The drawback of this approach is that there are not any states that really are that similar to California. Another approach would be to compare California before this reform to California after this reform. However, this approach requires that you assume that nothing else in California changed except for the adoption of the system. A third approach would be to construct a synthetic version of contempo- rary California designed to look like what California would look like had it never adopted the top-two primary system.

Scholars are also developing methods to generate synthetic data when protect- ing the anonymity or privacy of entities in a data set is critical. Suppose you have a survey of individuals that includes particularly sensitive information about them. Even if the data does not include their names, if it includes descriptive features about their race and gender, along with location indicators like ZIP Codes, it has been demonstrated that merging this anonymous data with the data from the U.S. Census or other sources can allow for the accurate identification of most of your survey respondents.

In order to make the data available to researchers, we need a method of com- pletely protecting the privacy of the survey respondents. This is done by gener- ating a simulated version of the data set in question designed to contain the same properties as the original data, but consisting of observations that are by definition

not real people.

Evaluating Statistical Estimators

If the assumptions behind Ordinary Least Squares regression are all met, it can be demonstrated that the parameter estimates of an OLS model are unbiased, efficient, and consistent. If we violate one of those assumptions, however, what are the consequences? In this case, we have formal mathematical proofs that tell us what to expect. However, mathematical proofs are not always intuitive. We can use computer simulations to evaluate the performance of statistical estimators instead.

For example, we might first generate a sample of data based on a regression model where all of the assumptions are met. For example, we might generate data for a simple regression using the following model:

Yi=β0+β1Xi+εi (7.5)

Where we set β0 = 0.2 and β1= 0.5. If we did this 1000 times to generate

1000 sets of data and then ran a regression on every data set, we would have 1000 estimates of each regression coefficient. Thinking about the 1000 estimates ofβ1,

some of them would be larger than 0.5 and some of them would be smaller than 0.5 strictly due to random chance (because of the random error term that is part of the data generating process). However, because the estimates should be unbiased, the distribution of them should be centered at nearly 0.5 and the average of them all should be very near 0.5. If the estimates for this parameter were biased on average, they would not be centered around 0.5 and the mean of them would not be near 0.5.

One problem that introduces bias into the estimate of parameters from a re- gression model when using OLS is random measurement error in the independent variable, which in this case is X. This can be demonstrated using a mathematical proof, but you could also alter your data generating process to include a random component to the construction of X. You could use this new data generating pro- cess to generate a thousand simulated data sets, estimate a regression on each data set, and then look at the mean and the distribution of this new set of 1000 values

forβ1. That distribution will not be centered on 0.5 and the mean of those 1000

estimates will not be near 0.5. How far away from the true value of 0.5 will they be? That depends on several things, including how much random measurement error there is in the independent variable. You can rerun the experiment changing the amount of measurement error and see for yourself.

This can be a great way to learn about statistical methods. It is also an excellent way to evaluate the performance of particular methods against the types of data you might have. Notice that we use the computer in this example to both simulate the data and to evaluate the performance of the statistical model. For more detail, including sample computer code, for this and many other similar simulations, I encourage you to look at the book Jeff Harden and I wrote that was mentioned above.

Evaluating Theoretical Propositions

I have already stressed the need for your theory written in words to be compat- ible with the statistical model you hope to use to evaluate the theory. Both must be compatible representations of the same data generating process. If so, you can use a computer to define everything after the equals sign of a statistical model, use that to generate a set of outcomes from that data generating process, and then compare those generated outcomes to an actual measure of the outcomes to see how well your predictions fit.

This can be taken further in a number of directions. For example, agent-based models are computer simulations used to generate predictions regarding the pat- terns that might emerge from the joint behavior of lots of individuals. The idea is that individual agents have a set of rules that govern their behavior. Those rules include what to do in response to the behavior of other agents around them. Thus, the behavior of agents is interdependent. The result is oftentimes surprising pat- terns in the behavior of the group that you might not have predicted based on the decision rules that govern each agent.

In a classic example, Thomas Schelling proposed a model in 1971 to explain racial residential segregation. He imagined a checkerboard where each square represented a residence. Then he imagined two types of people living on the checkerboard. He assumed that each person had a slight preference for their set

of immediate neighbors to be a majority of their type.

He then started by randomly assigning individuals to squares. All those with a majority of their immediate neighbors being of their type would prefer not to move. All those with a majority of their immediate neighbors being not of their type would prefer to move. You allow all those who would prefer to move to relo- cate. Neighborhoods have changed as a result, so all individuals reevaluate their location again. Again, some would prefer to move and others not. If you continue this process until no one wants to move, Schelling discovered that even if people only have a small preference for living in a neighborhood that is just a majority of their type, you end up with a residential pattern where a vast majority of peo- ple live in neighborhoods that are overwhelmingly of their type. This simulation shows that a society can end up with a very high degree of racial segregation even if most people only have a slight preference for being in neighborhoods where just more than half of their neighbors look like them.

Environmental scientists, climatologists, population ecologists, and public health scholars, among many others, have been using agent-based models and other types of simulations to study complex interdependent systems of behavior. Social scientists have not used these methods as frequently, though there are some excellent studies of crowd behavior that do so. Such simulations can be evaluated against observable data as a way of assessing their performance.

Generating Null Hypotheses

I once saw a presentation by a graduate student who had developed an agent- based model to explore recidivism among people released from jail or prison. There is ample evidence that people who have been in prison before are more likely to commit crimes than are people who have not been in prison. A common argument is that it is the prison experience that makes people more likely to com- mit crimes. She built a model that said the likelihood of committing crimes is a function of the local environment in which people find themselves. Her model had absolutely no recidivism mechanism built into it. Her idea was that people who leave prison are more likely to find themselves located in dysfunctional family and/or friendship networks, and this was the cause of continued criminal behav- ior. Her simulation produced results showing higher rates of criminal behavior among former prisoners that matched existing data, but her model produced that

result from a simulation that explicitly ruled out the prison experience as the cause of criminal behavior.

Does this analysis prove that the prison experience itself does not make people more likely to commit subsequent crimes? No, but it does demonstrate that a different theory about the local context into which released prisoners move can also provide an explanation for the pattern we see in the real data. Evidence of more criminal activity among former prisoners is consistent with the idea that the prison experience is the cause, but it is also consistent with this other theory. The implication is that researchers need additional evidence to distinguish between these two theories. In particular, they need a circumstance where the two theories render different predictions.

My main point is that many research projects start with a null hypothesis that is way too simplistic. The null hypothesis should describe what you would expect to see if a particular factor had no impact on the outcome of interest. In this case, a scholar might start with the idea that if the prison experience had no impact on the likelihood of future criminal activity, then the crime rate between those with prison experience and those without it should be the same. Finding a difference constitutes evidence for rejecting that null hypothesis. Finding something other than nothing might be an important first step, but this agent-based model built around a reasonable set of assumptions resulted in a prediction of a difference in crime rates even in the absence of a prison experience effect. In my mind, this model provides a better baseline for comparing the effects of the prison system – a better null hypothesis.

In the simulation book Jeff Harden and I wrote, we explore several examples where simulations can help you generate much more plausible null hypotheses. Our objective is to encourage researchers to think about what the process under study would look like with and without a particular causal factor that is key to the theory at hand. The process in the absence of that predictor is still likely to have structure, predictability, and patterns that emerge. A null hypothesis of nothing going on is, thus, frequently na¨ıve and simplistic.

Performing Statistical Tests

Many statistical routines are similar in spirit to computer simulations. This is particularly true in the family of methods based on resampling. Resampling methods are similar to simulations because resampling methods generate large numbers of data sets. Simulations generate those data sets out of a data generating process that you create on the computer. In contrast, resampling methods generate lots of data sets out of the actual data you have in your hands.

For example, suppose you have 10 students in a seminar. I could create an- other sample data set by selecting from these 10 students one student at a time, recording their information, putting them back in the pool where they might be se- lected again, and then selecting another one of the 10 students, etc. I do this until I have 10 observations in my new sample. Because I am sampling with replace- ment, Mary might show up in this new sample twice while Billy is not included at all. I could repeat this 1000 times and have 1000 different reconfigurations, or resamplings, of my original data. I could then compute any statistic of interest for each of these 1000 data sets. Then I could treat those 1000 estimates of the statistic as capturing the range of plausible values for that statistic. In fact, I could compute that statistic for my original data, and then use the 2.5 percentile and the 97.5 percentile of the values of the statistic from my 1000 resamplings to con- struct a 95% confidence interval around my statistic of interest from my original sample.

This sounds like a lot of work, but in most circumstances computers can do this very quickly. In the example above, I could compute the mean grade-point average of my 10 students and then use resampling to construct a 95% confidence interval around that. Of course, there is a well-established formula for computing a confidence interval around a mean, so I could use that instead. In this case, I might prefer resampling if I am concerned that my small sample size might render that formula inappropriate.

A better example would be to compute the median grade point average of my 10 students and then use resampling to compute 1000 more medians that I could use to construct a 95% confidence interval. Why is this a better example? Because there is no statistical formula we could use to build a confidence interval around a median.

A former student of mine, Justin Kirkland, put this logic to good use when studying the degree of partisan clustering in co-sponsorship networks in state legislatures.4 Networks consist of nodes and ties between those nodes. In this case, each legislator in a chamber is a node and whether they cosponsor a bill to- gether or not constitutes a tie between them. In network science there is a statistic called modularity. It measures the density of ties between nodes within designated groups compared to the density of ties between nodes that bridge groups. Specif- ically, Justin wanted to look at cosponsorship ties among members of the same political party relative to cosponsorship ties that crossed party lines.

Modularity is a perfectly good measure of this phenomenon, but it has no known statistical distribution. In that sense, you cannot declare a particular value of modularity to be statistically significantly different from zero, nor could you claim two different values of modularity are statistically significantly different from each other. Simulation to the rescue!

First, Justin computed the observed level of modularity in each chamber. Then, for each chamber, Justin held the number of Democrats and Republicans constant, the total number of cosponsorship ties constant, but he used a computer to ran- domly assign cosponsorship. He did this 1000 times (or more) for each chamber, each time computing a measure of modularity between Democrats and Republi- cans. This produced a distribution of modularity scores that might occur strictly due to random chance under the hypothetical situation that legislators ignored party when deciding which bills to cosponsor. If legislators actually care about party when deciding which bills to cosponsor, the observed modularity scores for each chamber would be much higher than the scores produced through random chance. In fact, Justin used the same method described above to compute a 95% confidence interval and then evaluated whether the observed level of modularity fell within or outside of this confidence interval.

The general point is that resampling and other similar methods allow you to conduct statistical tests regarding any statistic you can compute, even if that statis-