Chapter 2. Comparing Two Proportions: Randomization Methods

(1)

Randomization Methods

Assumed Knowledge:

• Core logic of tests of significance (for a one proportion test) – null and alternative hypotheses – criminal justice system analogy

• P-value; p-value is “as extreme as” • Simulation to estimate probabilities

• Null hypothesis is a statement of “no effect” or “equality” • Alternative hypothesis is a statement of the research question Learning Objectives:

• The statistical spreadsheet

• Difference between categorical and quantitative variables

• Be able to distinguish between response and explanatory variables • Be able to create and interpret a 2x2 table

o_{Why using row percentages makes more sense than column} percentages

o_{Why compare percentages and not numbers} o_{Marginal vs. conditional proportions}

• Statement of null and alternative hypotheses when exploring relationships in a 2x2 table

• Reinforcement of the core logic of statistical inference • Data manipulation and randomization tests in Fathom • One-sided vs. two-sided tests

• Observational study vs. experiment and different conclusions that can be made from each.

(2)

Section 2.1: The statistical spreadsheet and types of variables

In this course, we will use statistics to discuss gathering, summarizing and making decisions about data. Data are simply various types of information gathered on

individuals and stored in spreadsheets. Spreadsheets are used in all statistical software programs. The programs we focus on in this course are Fathom and PASW: Predictive Analytics Software (formerly known as SPSS: Statistics Package for the Social

Sciences). We will begin by looking at a Fathom spreadsheet. We will also use Excel spreadsheets in this class because many programs have data export and import functions to Excel.

A spreadsheet is a grid of rows and columns. The rows represent the individuals that are in the study. These individuals are often referred to as subjects, especially if they are in an experiment. The individuals can be humans, or they can be other objects such as plants, animals, or chemicals. The columns represent the variables measured or observed on the individuals. Thus the number of rows is the number of individuals in our experiment or observational study/survey, and the number of columns is the number of variables measured or observed on the individuals.

Figure 2.1: Example of a spread sheet Collection 1

idnum ber gender age year class

1 2 3 4 5 6 7 1 0 21 4 4 2 0 18 1 1 3 0 20 2 2 4 1 20 2 2 5 1 21 4 4 6 0 20 2 2 7 0 20 3 3

There are two main types of variables: categorical and quantitative.

Categorical variables place the individual in a category. Examples would be gender, ethnicity, eye color, or class standing.

Quantitative variables are measures on an individual for which arithmetic operations make sense. In other words, you can add, subtract, multiply and divide these variables. Examples would be height, weight, distance, or time.

In chapter 1, we explored tests of significance for a single categorical variable with two categories. For example, we had 16 babies (rows in the spreadsheet) and for each baby we knew whether they chose the nice or naughty toy (single column in the spreadsheet). The number of types of variables that we are interested in exploring will be a key part of knowing which statistical methods to use to analyze our data.

(3)

Activity 2.1: Introduction to Fathom

In this activity we are going to get familiar with some of the basic techniques in the computer program Fathom. We will use Fathom for much of our statistical analyses over the next few chapters. Obtain the class_survey.ftm file from the textbook website. This survey was given to 265 introductory statistics students on the first day of class.

To answer the questions on the next page you will need to know the variables in the data as well as to do some simple things in Fathom. Variable definitions are below and Fathom instructions are in the gray box on the next two pages.

Variables

Gender (Female, Male)

Class standing (Freshman, Sophomore, Junior, Senior, Super Senior, Other) Height in inches

Weight in pounds

Amount of last haircut (to nearest dollar) Number of cups of coffee drank yesterday

Ever purchased required college texts online (Y/N) Minutes browsing Internet yesterday

Minutes watching TV/DVDs yesterday Minutes exercising yesterday

Ever had a car crash while driving (Y/N)

Views align with Republican Party (Strongly disagree, Disagree, Neither disagree nor agree, Agree, Strongly agree)

Views align with Democratic Party (Strongly disagree, Disagree, Neither disagree nor agree, Agree, Strongly agree)

Primary major (Natural Sciences, Social Sciences, Arts, Humanities)

1. In our survey, which variables are categorical?

2. Which variables are quantitative?

3. Create a dot plot of any quantitative variable. Then, change it to a histogram. Note: See Fathom instructions for how to create graphs.

4. Create a bar graph of a single categorical variable. Note: See Fathom instructions for how to create graphs.

(4)

Fathom Instructions

Overview

Fathom has a click-drag and drop workspace and uses some terminology from the computer science discipline. Our data is stored in a Collection. If we want to look at our data in a spreadsheet, we need to drag down a Table from the tool bar. Our individuals are numbered by row and our variables are listed by column. Fathom calls variables attributes. We’ve summarized Fathom’s terms and how they correspond to ours in the following table.

Fathom terminology Statistics class terminology

Collection Dataset

Table Spreadsheet

Attributes Variables

Looking at the spreadsheet

Click once on the 210 class survey collection and you will get a blue box around the collection. Now drag down a Table from the toolbar.

Using Fathom to construct graphs on our sample

If you drag down a Graph you will be asked to drop an attribute into the graph. The default graph for a categorical variable is a bar graph and for a quantitative variable is a dot plot.

Using Fathom to get statistics from our sample

If you drag down a Summary, you will be asked to drop an attribute into it. The default summary for a categorical variable is a count and for a quantitative variable is a mean. The formulas for these statistics are preprogrammed. If you right click on the variable (attribute) name in your summary window, you will see many options, the first of which is to Add a Formula. There are many preprogrammed formulas you can use. If you choose Add a Formula and click on the pluses in front of Functions, Statistical, and One Attribute, you can see all the preprogrammed functions for a single variable (attribute). You can construct your own functions using functions that already exist like Sum.

(5)

Section 2.2: Exploring Relationships in 2x2 Cross-Tabulation

Tables

The Norovirus Outbreak of 2008

In November 2008 an outbreak of norovirus struck students, faculty and staff at Hope College. According to the Ottawa County Health Department, “Norovirus is a highly contagious illness and a common cause of diarrhea, nausea, and vomiting.

Hospitalization is rare and most people recover in 24 to 48 hours.” During the November 2008 outbreak, there were over 450 confirmed cases of norovirus among Hope students alone. Ultimately, the health department ordered the college to cancel all classes and athletic events, close all dining facilities and cancel any other event that entailed people gathering on campus. In order to learn more about how the disease spread through Hope students and its long-term impact on healthy living behaviors among Hope students, a survey of over 1800 Hope students was conducted in February 2009 (the Norovirus dataset in the databank).

Among many other questions, students were asked their gender and whether they got norovirus during the outbreak. Table 2.1 summarizes results from a cross-tabulation of gender by whether or not students got the norovirus.

Table 2.1: Number of students with norovirus by gender Norovirus during Nov ’08 outbreak?

Total

Gender Yes No

Female 212 1049 1261

Male 124 479 603

Reviewing cross-tabulation tables

Notice that Table 2.1 explores two different variables: gender and whether or not

someone got norovirus. When two variables are being studied and both are categorical, the standard way of summarizing the data is using a tabulation table. A cross-tabulation table presents the counts of the number of individuals in each combination of categories for the two variables of interest. For example, Table 2.1 shows us there were 212 females in our survey who said they contracted norovirus. This table is called a 2x2 cross-tabulation table because both variables have two categories and thus there are two rows and two columns in the table.

All scientific studies are conducted to investigate one or more research questions. Research questions can be thought of as the primary purposes or motivations for

conducting a scientific study. The research question gives us information about why the researchers have conducted the study. One research question we could ask based on Table 2.1 is whether male Hope students were more likely to get norovirus than female Hope students. We might hypothesize that since prior research shows men tend to wash their hands less often after using the bathroom (Edwards et al. 2002), that maybe male Hope students were more likely to get the very contagious norovirus than females. In order to answer this research question we will compare the men to the women in terms of their norovirus prevalence. Notice how this is different than what we did in

(6)

Chapter 1. In Chapter 1, our research question was NOT comparing two separate groups, instead we were exploring a research question about a single group (e.g. Did babies prefer the helper toy vs. the hinderer toy?). Similarly, in Chapter 1 our research question only involved a single variable (column) in the statistical spreadsheet (e.g. toy the baby preferred). Here the research question involves two variables (specifically, gender and norovirus status).

When a study’s research question involves two variables, we can often identify an explanatory variable and a response variable. An explanatory variable is the variable we think is “explaining” the change in the response variable. You may also have heard the explanatory variable called the “independent” variable or the “predictor” variable in other classes. The response variable is the variable we think is being impacted or changed by the explanatory variable. You may have heard the response variable called the

“dependent” variable in other classes. In Statistics, the terms independent, dependent and predictor can have many different meanings, and so we will use the terms

explanatory and response in this book.

In Table 2.1, the explanatory variable is gender and the response variable is whether or not someone got norovirus because we think that gender may be explaining (related to) norovirus status. It wouldn’t make sense to call norovirus status the explanatory variable because it doesn’t make any sense to think that somehow a student’s norovirus status is impacting/changing his/her gender! In Table 2.1 we also see that the explanatory

variable is the variable that creates the rows in the table and that the response variable creates the columns. This is a standard way of creating a cross-tabulation table, and the set-up we will employ throughout the book.

In our sample of Hope College students, 212 female students contracted norovirus compared to only 124 male students who contracted norovirus. Does this mean that, in our sample, female students were more likely to contract norovirus? Not necessarily. An important thing to recognize is the sample of students in the survey has more than twice as many females as males (1261 vs. 603), so even though more females

contracted norovirus, that does not mean males had a lower chance of contracting the norovirus. Instead of comparing the counts of male and female students who contracted the norovirus, we should compare the percentages of male and female students who contracted the norovirus. Note: Prevalence is the term used to describe the percentage of people with a disease.

Key Idea: Often, when you are exploring relationships between two variables, you can identify an explanatory variable and a response variable. You may have heard the explanatory variable called the “independent” or “predictor” variable and the response variable called the “dependent” variable in other classes.

Key Idea: Typically, cross-tabulation tables are created so that explanatory variables create the rows and response variables create the columns.

(7)

Table 2.2: Prevalence of students with norovirus by gender

Norovirus prevalence Total

Gender Yes No Female 16.8% (212/1261) 83.2% (1049/1261) 1261 Male 20.6% (124/603) 79.4% (479/603) 603

Notice (see Table 2.2) that the prevalence of norovirus is actually higher among Male hope students in our sample (20.6%) compared to Females (16.8%), a difference of 3.8%, even though more females in the sample had norovirus. For this reason it is important to always compute the percentages instead of simply comparing counts. Furthermore, by setting up your table with the explanatory variable creating the rows and the response variable creating the columns (see key idea on previous page) you can always compute the row percentages to best explore the relationship between variables. Row percentages are the percentages computed so that each row’s percentages add up to 100%; more specifically, row percentages are computed by dividing each cell count by the total for that row. This will give you percentages of the response variable for each value of the explanatory variable. In this case, that means you will be able to compare the prevalence of norovirus between males and females.

Inference for cross-tabulation tables

Clearly, in the sample, males were more likely to contract norovirus than females (20.6% vs. 16.8%; difference of 3.8% (=20.6-16.8)). Our real interest, however, is not in the sample of Hope College students. Instead, what we’re really interested in are all Hope College students. For practical reasons we couldn’t get all students into our study, so we’re trying to use our sample to learn about the population.

With this in mind, does our sample say anything conclusive about all Hope College students (the population)? Stated another way, is this sample sufficient evidence that among all Hope College students (population) males were more likely to contact norovirus than females? Or, conversely, would it be better to attribute the difference in prevalence simply to random variation in our sample of students (our sample is different than the population)? Tests of significance can help us to answer this question.

Recognize that, because we do not have information on the entire population of Hope students, the difference we are seeing in our sample (male norovirus prevalence larger

Key Idea: When exploring relationships between variables in a two-way table, use row percentages instead of comparing cell counts.

Key Idea: Gathering information on all members of a population of interest is often impossible or impractical and so a main goal of many statistical analyses is to use a sample to learn about a population. In fact, even if gathering information on all members of a population is possible, it is almost always more costly than gathering information on only a sample. Statistics can be used to extract information about the population using only the sample.

(8)

than female norovirus prevalence) could have occurred simply by random chance (we randomly happened to end up with males in the sample who were more likely to contract norovirus than females).

As you saw in Chapter 1, the first step in tests of significance is to state the null and alternative hypotheses. Remember, the alternative hypothesis corresponds to the research question we hope to answer. In this case:

Ha (Alternative hypothesis): Male Hope students had a higher prevalence of norovirus

than female Hope students during the norovirus outbreak.

But, what about the null hypothesis? As we discussed in Chapter 1, the null hypothesis is a statement of equality or of no effect. In this example, a statement of equality would mean that males and females had the same prevalence of norovirus.

H0 (Null hypothesis): Male Hope students had the same prevalence of norovirus as

female Hope students during the norovirus outbreak.

We have said hypothesis testing is like our criminal justice system in that we assume someone is innocent until they are, based on the evidence, proven guilty. In hypothesis testing we assume the null hypothesis is true and then look at the evidence (data) to see if we can conclude that the alternative hypothesis is true.

From Chapter 1, recall that in order to “measure the evidence” we see what outcomes of the study are likely to occur if the null hypothesis is true, and then see how likely/unlikely the actual outcome is. In our example, if the null hypothesis is true, this means that chances of contracting norovirus are the same regardless of whether someone are male or female.

Table 2.3: Excerpt from the statistical spreadsheet for the norovirus example norovirus

Gender Norovirus <new > 1 2 3 4 5 Male Yes Females Yes Female No Male No Female Yes

Table 2.3 shows the statistical spreadsheet for the norovirus example. If the null

hypothesis is true (chance of norovirus is the same for males and females), then it really shouldn’t matter if I mix up (scramble) the gender column as shown in Table 2.4. That is, if the null hypothesis is true and males and females have the same norovirus

Key Idea: The difference in male and female norovirus prevalence in our sample can be attributed to either a true difference in the population (males actually contract the norovirus more) or random variation (our sample shows different prevalence, but there is no difference in the population).

(9)

prevalence, then if I scramble the gender column and look at the “new” norovirus

prevalence of the males and females I shouldn’t see prevalence’s that are much different from what I saw in my sample.

Table 2.4: Excerpt from the scrambled statistical spreadsheet for the norovirus example Scrambled norovirus

Gender Norovirus <new > 1 2 3 4 5 Female Yes Male Yes Female No Female No Males Yes

Let’s use the scrambled spreadsheet to recreate the 2x2 cross-tabulation table.

Table 2.5: Table of norovirus prevalence after mixing up (scrambling) the gender column Norovirus prevalence Total

Scrambled Gender Yes No Female 18.2% (230/1261) 81.8% (1049/1261) 1261 Male 17.6% (106/603) 82.4% (479/603) 603

Table 2.5 shows the result of scrambling the genders column and then re-creating the cross-tabulation table. Here, 18.2% of females had norovirus compared to 17.6% of males (a difference of 17.6-18.2= -0.6%). So, when we scramble the gender column one time we get a difference quite a bit smaller than what we got in the actual study. In fact, the difference is actually negative because in the scrambled table females had a higher prevalence than males. To summarize, we said that if the null hypothesis is true (illustrated by scrambling the gender column), I might see a number like -0.6% for the difference in male to female norovirus prevalence.

Recall from Chapter ` that we need to “simulate” the null hypothesis occurring many, many times to get a feel for how unlikely our observed data is if we assume the null hypothesis is true. In Figure 2.1 we show a histogram of the difference in prevalence (male prevalence minus female prevalence) for 1000 scrambles of the gender variable.

Key Idea: If the null hypothesis is true (norovirus prevalence is the same for males and females in the population), then it doesn’t matter if I mix up (scramble) the gender column as shown in Table 2.4. Stated another way, the null hypothesis says there is no relationship between gender and norovirus prevalence; if they are not related, then if I mix up one of the columns, it really doesn’t matter.

(10)

Figure 2.1: Histogram of difference in male and female norovirus prevalence when scrambling gender

In the bottom right corner we have shaded the part of the histogram representing

prevalence differences by scrambled gender that are as large or larger than the 3.8% we saw in our sample. In other words, we’ve shaded the part of the graph that represents the portion of times we, randomly, saw males having a prevalence of at least 3.8% larger than females when we assumed the null hypothesis to be true. This occurred 23 times out of 1000 scramblings. Thus, the p-value is 23/1000=0.023.

As we’ve talked about previously, a small p-value is evidence against the null

hypothesis. Since 0.023 is fairly small, we have fairly strong evidence that a difference of 3.8% in the sample means that male Hope students were more likely to contract norovirus than female Hope students in the population of all Hope students. That is, we have fairly strong evidence that the null hypothesis is wrong and, thus, that the

alternative hypothesis is correct.

This type of analysis is essentially the same process we looked at in Chapter 1. It can be summarized in a manner similar to Chapter 1:

1. Compute the measure in the sample. In this case, the measure is the difference in the proportions of males and females (3.8%).

2. Scramble the explanatory variable and re-compute the measure of association many times in order to simulate the null hypothesis. In this case we scramble the gender variable to simulate the null hypothesis of no difference in male/female prevalence. Note the difference in this step compared to Chapter 1.

3. Reject the null hypothesis if the measure in the sample is in the tail of the distribution generated from scrambling the explanatory variable and re-computing the measure of association (step #2).

(11)

Activity 2.2A: Swimming with dolphins to treat depression

Adapted from Concepts of Statistical Inference: A Randomization Based Curriculum by Rossman, Chance, Cobb, Holcomb, NSF/DUE/CCLI # 0633349.

Depression is one of the most common and debilitating diseases in the United States and around the world. Recently, researchers recruited 30 adults with a clinical diagnosis of mild to moderate depression to participate in an interesting study. In the study, all 30 adults stopped the use of any drugs or therapies they were taking for their depression and were flown to an island off the coast of Honduras. All of the depressed adults in the study participated in the same types and amounts of swimming, snorkeling and other tropical recreation, but 15 of the individuals in the study were randomly selected to also swim with dolphins every day for two weeks. At the end of the study, individuals in the study were evaluated to see if their depression symptoms had improved (Antonioli and Reveley, 2005). Table 2.6 shows a summary of the results of the study.

Table 2.6: Reduction in depression symptoms Reduction in depression symptoms? Total Swam with Dolphins daily? Yes No Yes 10 5 15 No 3 12 15 Total 13 17 30

1. Exploring the dolphin study

a) What is the research question for the dolphin study?

b) Based on Table 2.6, what variables were measured in the dolphin study? Are these variables quantitative or categorical?

c) Later you will look at the statistical spreadsheet for the dolphin experiment, how many rows and how many columns will there be in the spreadsheet? Why?

d) Identify which of the variables in the dolphin study is the explanatory variable and which is the response variable.

e) We discussed how you should compute percentages in order to compare the groups created by the explanatory variable, instead of simply the cell counts in cross-tabulation tables. Compute and compare the appropriate percentages for Table 2.6. Also find the difference in the two percentages.

(12)

2. Testing the research question

Next we will use a test of significance to answer the research question.

a) Based on your research question, state the null and alternative hypotheses for the associated test of significance.

b) Do you think it is possible that, if the null hypothesis is true (no effect of dolphin therapy on depression), you would see percentages like you reported in 1(e)? Do you think it is unlikely?

Yes, it is possible. Consider the following scenario: Assume that the 13 people in the study whose depression symptoms improved would have improved whether they swam with dolphins or not, let’s call them the “improvers”. Remember, everyone is getting flown to a nice, sunny, tropical location, so it makes sense that some people might be less depressed simply from the change in location. What we’re really saying here is to assume that swimming with dolphins really doesn’t do anything to help people get over their depression (the null hypothesis is true). Now, what if, by chance, of the 13

improvers (people who improve no matter what), 10 randomly ended up in the “swim with dolphins” group. Recall the important fact that subjects were randomly assigned to swim with dolphins or not. So, randomly, it is possible that 10 improvers end up in the swim with dolphins group and only 3 in the not swim with dolphins group. Thus, it is possible we would see 67% (10/15) of dolphin swimmers improving and only 20% (3/15) non-dolphin swimmers improving even if swimming with dolphins doesn’t actually make a difference.

So, it is possible….but how unlikely is it? In order to answer this question we will once again turn to simulation. Remember that simulation was a method we used to estimate probabilities. In Chapter 1 you used coin flipping and a web-applet to simulate the p-value (the probability we would observe our data or something more extreme if the null hypothesis was true) for different hypothesis tests. Because the dolphin experiment is more complex than the studies we looked at in Chapter 2, we can’t simply flip a coin any longer. Instead, to do our simulation, we’re going to use playing cards, and then later we’ll use Fathom.

To estimate the p-value for this study you will need 30 playing cards of which 17 should be black and 13 should be red. The 17 black playing cards represent the

“non-improvers.” The 13 red cards represent the ““non-improvers.”

Practical Tip: If you don’t have playing cards, you can just use 30 slips of paper and write “I” for improver on 13 of them and “NI” for nonimporver on the other 17. Instead of shuffling, just mix them up.

Your 30 cards represent the 30 depressed subjects in the study. You are assuming that 13 people (the red cards) will get better no matter whether they swim with dolphins or not, instead they’ve improved simply because they’ve gotten to fly to Honduras and sit on the beach for a while. Thus, you are assuming the null hypothesis is true.

(13)

c) Now shuffle your 30 cards, face down, into two stacks of fifteen. One of the two stacks represents the people who got to swim with dolphins and the other stack represents people who didn’t. Decide which stack is which and then fill in the table below.

Table 2.7a: Reduction in depression symptoms—random assignment Reduction in depression symptoms? Total Swam with Dolphins daily? Yes No Yes % ( /15) % ( /15) 15 No % ( /15) % ( /15) 15 Total 13 17 30

Now, shuffle, deal and enter your data into the following tables (Tables 2.7b-e) so that you have done the simulation a total of 5 times.

Table 2.7b: Reduction in depression symptoms—random assignment Reduction in depression symptoms? Total Swam with Dolphins daily? Yes No Yes % ( /15) % ( /15) 15 No % ( /15) % ( /15) 15 Total 13 17 30

Table 2.7c: Reduction in depression symptoms—random assignment Reduction in depression symptoms? Total Swam with Dolphins daily? Yes No Yes % ( /15) % ( /15) 15 No % ( /15) % ( /15) 15 Total 13 17 30

(14)

Table 2.7d: Reduction in depression symptoms—random assignment Reduction in depression symptoms? Total Swam with Dolphins daily? Yes No Yes % ( /15) % ( /15) 15 No % ( /15) % ( /15) 15 Total 13 17 30

Table 2.7e: Reduction in depression symptoms—random assignment Reduction in depression symptoms? Total Swam with Dolphins daily? Yes No Yes % ( /15) % ( /15) 15 No % ( /15) % ( /15) 15 Total 13 17 30

Now, for each of the five tables (Tables 2.7a-e) enter the difference between the percent of people who had a reduction in depression symptoms among those swam with

dolphins and those who didn’t. Recall that in question 1(e) you calculated the same difference in percentages on the real data.

Table 2.8: Difference in percentages

From Table 2.7a 2.7b 2.7c 2.7d 2.7e

Difference in percentages

When you’re done, add your percentages to the dotplot in the front of the room. If many other groups in the class have not yet finished, skip ahead to questions (h), (i) and (j) while other groups are finishing.

d) Sketch the class’ dotplot in the space below. You could also sketch a histogram that corresponds to the dotplot.

e) How many times did the class get a difference in percentages at least as big as 46.67% (what we actually observed in the study)?

f) Use your answer in question (e) to find the p-value. Recall from Chapter 1 that the p-value is the probability that we would observe a measure of association at least as large as we did in our study, assuming the null hypothesis was true.

(15)

g) Based on the class results, what is your conclusion about the effectiveness of swimming with dolphins in improving people’s depression symptoms?

h) When you shuffle and deal your cards, you are simulating one of the two hypotheses (null and alternative) to be true, which one are you simulating? Explain how you know you are correct.

i) Why do we always have 13 red cards (improvers)? Why don’t we ever change this when we are shuffling and dealing?

j) It’s possible that your answer to question (f) is zero (some classes will be zero). If your class got zero for a p-value, how could you modify the simulation to get a non-zero p-value? If your class got a non-zero p-value, explain how you could do the simulation again as a class and how the p-value could be zero the next time. What steps could you take to ensure you virtually always get a non-zero p-value?

(16)

(17)

Activity 2.2B: Finding a p-value using Fathom

In part 1 of this activity, question 2(j) demonstrated that since you were only obtaining the p-value using around 50 or so different shuffles you might get a p-value of 0 even though we know it is possible to shuffle the cards in such a way that we get a difference of at least 46.67% (and, hence, a non-zero value). Thus, to get a more accurate p-value we should shuffle the cards and run the simulation more times.

But, shuffling playing cards is not the most efficient way of computing the p-value for the study, so let’s turn to technology and use the statistical program Fathom to shuffle for us. Fathom will simulate the shuffling for us by following these three steps:

1. Randomly assign individuals to be either in the swim with dolphins group or not. 2. Each time compute the difference in the percentage of improvers in each group

(this is the measure of association for our example).

3. Store the difference in percentages from each random assignment. Do steps 1-3 many, many times.

Use Fathom to do 1000 shufflings. To do this, use the detailed Fathom instructions given on the following pages. Look at the distribution (e.g. dotplot or histogram) of difference in percentages from all of the shufflings and answer the following questions: 1. What is the center of the distribution?

2. What is its shape?

3. Is the actual difference in percentages near the middle or near the tail of the distribution?

4. Compute the p-value.

5. What does your p-value tell you about the effect of swimming with dolphins on reducing depression symptoms?

6. How does the p-value you got from Fathom compare to the one that you got in class? Which one do you think is more accurate and why?

(18)

Using Fathom to do 1000 scramblings

1. Obtain the Dolphins.ftm file from the textbook website. 2. Open Fathom from the Programs menu.

3. Use File>Open to open the Dolphins.ftm file.

4. You will see a small box called “Dolphins” sitting in a large white space. This is the Dolphins dataset. It’s icon looks like this:

Dolphins

5. Click once on the Dolphins dataset to select it. Then, “drag down” a Table from the toolbar. You should see something like what is below.

Dolphins

SwimDo... Improv... <new> 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y N Y N Y N Y N Y N

This is the data set for the dolphins experiment. Notice there are 30 rows (scroll down to see them all) and two columns. Each row is a person, the first column indicates whether or not the person swam with dolphins and the second column indicates if the person’s depression symptoms improved.

6. Now, drag down a “Summary” from the toolbar. Notice you can drop attributes (Fathom’s name for variables) to create rows or columns. Drop “SwimDolphins” (the explanatory variable) to create rows and

“ImprovedDepression” to create columns (Recall the Key Idea earlier about how to create cross-tabulation tables). You will now have a cross-tabulation table that looks similar to the one we have been using, except the order of the rows and column is different. Order of the rows and columns doesn’t matter (Note: Fathom has put them in alphabetical order, N before Y).

7. Now, double click on the Dolphins collection to reveal the “Inspector.” Click on the measures tab and you will see we have computed the difference in percentages of reduced depression between people who swim with dolphins and those who didn’t (46.67%).

(19)

Using Fathom to do 1000 scramblings (continued)

8. In order to have Fathom “shuffle and deal” (act as if the null hypothesis is true), click on the dolphins collection (Dolphins). Then go to Collection>Scramble

Attribute values. Notice that a new collection was created. In this new collection “Scrambled Dolphins” Fathom has scrambled (shuffled) the

SwimDolphins column: the equivalent of shuffling and dealing two stacks like you did.

9. Click on the Scrambled Dolphins collection and drag down a table to look at the scrambled dolphins spreadsheet. Compare it with the spreadsheet for the Dolphins collection. You should notice that the “Improvers” column is the same, but the SwimDolphins column is different. This is because we’ve scrambled up the SwimDolphins column.

10. Now, create a cross-tabulation table of the Scrambled Dolphins collection like you did for the real Dolphins dataset in step #6 above. You can then see how many improvers end up in the swim with dolphins group randomly. You can also double click on the scrambled dolphins collection to reveal the inspector and see what the difference in percentages is between the two groups. Confirm that the numbers in your cross-tabulation table match what you saw for a difference in percentages and that you understand how the difference in percentages was obtained.

11. If we had to do this every time it would still be a bit tedious, but we can automate much of the process. Click on the Scrambled Dolphins collection again. Now go to Collection>Collect Measures on the top toolbar. Click in the box next to “Animation On” to turn animation off. Click in the box next to “Replace existing cases” and then put 1000 in the box next to measures. Click “Collect More Measures” and allow time for Fathom to do the scrambling. Key Idea: Step 10 is asking Fathom to do 1000 shuffles and, each time, find the percent of those who swam with dolphins who improved in their depression symptoms when shuffled. It is “automating” the shuffling.

12. Now drag down a table to “look at” the Measures from Scrambled Dolphins collection (by clicking on the collection and dragging the table).

13. There is one column of data. Each number represents the difference in percentages of improvers. So, we’ve simulated 1000 shuffle and deals! 14. You can get a dotplot of the data by dragging down a graph from the toolbar

and dropping the diffprops variable where it says “drop attribute here.” You should see a picture similar to what we got as a class when we were shuffling cards except it will be more filled in.

15. To get a p-value, right click on the diffprops column and select

“SortDescending.” This sorts the diffprops column. You can then count how many random differences (out of 1000) are at least as extreme as the

(20)

(21)

Recap of Activity 2.1B

Like the criminal justice system, we assumed innocent (H0: swimming with dolphins

doesn’t impact depression symptoms) in order to prove guilt (Ha: swimming with dolphins

does improve depression symptoms). To assess the evidence (study data) we assumed that swimming with dolphins didn’t impact depression symptoms (null hypothesis is true) and so we randomly split subjects between the swim with dolphins and not swim with dolphins groups many, many times to see how many “improvers” would occur in the swim with dolphins by chance. When we saw that our result (46.67% more improvers in the swim with dolphins group) was unlikely to occur by chance we concluded that

swimming with dolphins improves people’s depression symptoms. This is the same as assuming that someone was innocent and then, after looking at the evidence, saying “There’s no way we would see this evidence if they are really innocent.” For example, the jury assumes someone’s innocent until they hear that the DNA of the defendant was at the scene of the crime, there was an eyewitness, and the defendant had a motive for the crime, then they conclude “there’s no way we would have evidence from DNA, an eyewitness and a motive if the defendant was innocent. Thus, the defendant is guilty.”

Shuffling? Scrambling? Does it matter?

In Activity 2.1A you shuffled and dealt cards into two stacks (one for those who swam with dolphins and one that didn’t). In Activity 2.1B you used fathom to scramble the explanatory variable. Does it matter which way you do it? In short, the answer is No. The goal of the shuffling/scrambling step is to simulate the null hypothesis. The null hypothesis says that there is no relationship between the explanatory and response variable. In the examples we’ve looked at in this chapter, the null hypothesis means that there is no difference in the proportion of people who get better when swimming with dolphins compared to those who get better without swimming with dolphins. In order to simulate the null hypothesis we need to do something to illustrate what would happen if there was no relationship between the explanatory and response variables. So, both shuffling and dealing cards and scrambling the explanatory variable, illustrate the null hypothesis. Thus, either method can be used.

(22)

Section 2.3: Study Designs and Their Impact on Conclusions

Think about the different conclusions you can make in the norovirus study compared to the dolphins study. In both cases we compare percentages from two groups. In both cases we see that the p-value is small enough to reject the null hypothesis. However, the difference is that, in the dolphins study, we can conclude swimming with dolphins causes a reduction in depression symptoms. While, for the norovirus study, our

conclusion was that we have evidence that male Hope students (all of them, not just our sample) were more likely to contract norovirus than female Hope students, but not that being male caused someone to contract norovirus. Why the difference?

The dolphin study is called an experiment because the researchers assigned individuals in the study (subjects) to different experimental conditions, called treatments. The dolphin study had two treatments “swimming with dolphins” and “not swimming with dolphins.” The researchers decided which subjects would swim with dolphins and which wouldn’t (randomly), but the key idea for right now is that the individuals in the study didn’t get to decide---they were randomly assigned.

The norovirus study was an observational study. In an observational study, a researcher measures variables on the individuals in the study, but does not assign them to different conditions. In other words, the researcher is not actively intervening to change the situation for the individuals in the study. Since the norovirus study was based on a survey of who contracted the norovirus and their gender, it is an observational study.

So, how does the design of the study (experiment vs. observational) impact the kind of conclusions that can be drawn? There are two key components of study designs that impact the conclusions that can be drawn. The first is how the individuals in the study have been assigned to the groups being compared and the second is how the

individuals in the study were selected to participate.

Allocation of individuals to groups

When considering the conclusions you can make from your study, you need to consider whether the individuals in the study were randomly allocated to the different groups. For example, in the dolphin study, subjects were randomly assigned to swim with dolphins or not, whereas in the norovirus study, gender was not randomly assigned (individuals entered the study with it!). So, how does this impact the results that can be made? Consider the norovirus study. We found that males were more likely to contract the norovirus. Notice, we did not conclude that being male CAUSED you to contract norovirus. Most likely it’s not someone’s “maleness” that caused an increase in risk of getting norovirus. Instead, it’s most likely that, since hand-washing can help prevent the spread of norovirus and males tend to wash their hands less, we see that males were more likely to get norovirus. Of course, we can’t really even say this with any finality; in fact it could be any other variable associated with gender like getting less sleep, eating less healthy, etc. etc. or, though unlikely, it could be maleness itself. In this case

hand-Key Idea: In a well designed experiment, researchers randomly assign subjects to treatment conditions. This is not the case in an observational study.

(23)

washing, or any of the other variables, besides gender, we just mentioned, are called lurking (or confounding) variables. Lurking variables are variables that are associated with both the explanatory and response variables. Thus, in the norovirus study, the lurking variable hand-washing may be a better explanation for what caused norovirus. This relationship manifested itself in our data as an increased risk of norovirus for males. Contrast this situation with the dolphin example. In the dolphin study, subjects were randomly assigned to the swim with dolphins and not swim with dolphins groups. Now, let’s assume that genetics is a potential lurking variable in this study. That is, let’s assume there is some gene (call it the “sunshine” gene), and if you have it then you will get better simply by going to Honduras (regardless of swimming with dolphins). Now imagine that the researchers don’t know who has the sunshine gene and who doesn’t. The problem is, what if all the people with the sunshine gene end up in the swim with dolphins group? Then, how would we know if it’s swimming with dolphins or the sunshine gene that is causing people to get better? We wouldn’t! In this case, the sunshine gene is a lurking variable. So what can we do about the sunshine gene? If we randomly assign participants to the two treatment groups then there will typically be roughly the same number of people with the sunshine gene in both the swim with dolphins and not swim with dolphins groups.1

How does this impact the conclusions that can be made from an experiment? The sunshine gene is only one of countless possible lurking variables. By randomly assigning subjects to treatment groups you have done your best to ensure that for all possible values of every lurking variable the treatment groups “look” approximately the same (e.g. approximately equal numbers of people with and without the sunshine gene in both treatment groups).

In an observational study there is no guarantee the treatment groups have similar values of lurking variables. In an observational study, there are always more lurking variables out there that could be explaining the relationship between the two variables you are investigating.

So, in a well-designed/conducted experiment, due to the randomization of subjects to treatments which controls all lurking variables, cause-effect conclusions are possible since the only possible explanation are the treatments themselves. This is not the case in an observational study where there are many possible lurking variables that could explain the data.

1

Clearly, if you don’t know who has the sunshine gene and who doesn’t, random assignment is the best option. But, what if you do know who has the sunshine gene? In this case, it would be better to make sure that both treatment groups have an equal number of individuals with and without the sunshine gene. The statistical term for this technique is blocking. Specifically, blocking means ensuring treatment groups are balanced on measured lurking variables.

Key Idea: You cannot draw cause-effect conclusions from observational studies because of potential lurking variables. Lurking variables are controlled in

experiments due to the random allocation of subjects to treatment groups. Thus, cause-effect conclusions are possible in experiments.

(24)

Selection of individuals for your study

When considering the conclusions you can make from your study, you also need to consider whether the individuals in the study have been drawn randomly from the population. If the individuals in the study have been selected randomly, then the conclusion made in the statistical analysis is applicable to the population. If the individuals in the study have not been selected randomly from the population, then the sample is a convenience sample and conclusions to the population are more difficult to make (see the next section for further discussion).

How do we ensure that our sample has been taken randomly? What does “randomly” mean? The most common method to ensure a sample is random is to take a simple random sample (or SRS). To take a simple random sample you first make a list of all members of the entire population of interest. You then use computer software to randomly choose people from that list. When using a computer may be inconvenient, people will take a systematic random sample. Examples of taking systematic random samples are taking every 10th person in the list, or taking the person at the top of each page of the phone book, etc. Other common types of random samples are stratified random samples and cluster random samples. See the exercises for more discussion and examples of these types of random samples. Participants for either experiments or observational studies can be obtained by random sampling, but random sampling is rarely used for experiments. Instead, most experiments use convenience samples. In fact, it is very difficult even for an observational study to have a truly random sample. Challenges in obtaining random samples

There are numerous challenges when trying to obtain a random sample.

One alternative to a random sample is to use a convenience sample. A convenience sample is any non-random method of including participants in the study. For example, if I conduct a survey of Hope students by handing out a survey to people in Phelps’ dining hall, this is a convenience sample. The problem with a convenience sample is there is no guarantee the sample is representative of the population.

Researchers who use convenience samples instead of random samples attempt to argue why their sample is representative of the population. A sample is representative of the population if the sample contains similar values of all variables (measured or lurking) that could be impacting the results of the study. A simple random sample (and other random sampling methods) uses probability to ensure the sample is

representative. Convenience samples do not have the benefit of a strong mathematical argument (i.e. probability) to ensure their randomness. Instead it is up to the researcher to make a convincing case as to the representativeness of the sample.

Consider again the dolphins experiment. In this experiment the 30 subjects were not obtained by way of a simple random sample or other random sampling method; instead these were volunteers and, thus, form a convenience sample. What if the subjects tended to be older individuals? What if the subjects were all females? How does this

Key Idea: In a convenience sample, there is no guarantee the sample is representative of the population.

(25)

impact the conclusions from the study? Remember, the study showed fairly conclusive evidence that swimming with dolphins improved depression symptoms. Clearly,

however, we cannot conclude that dolphin therapy works for all individuals, but only for the individuals in the sample. The extent to which swimming with dolphins will help all individuals with depression corresponds with how much the sample is representative of all individuals with depression.

Other reasons that obtaining random samples are hard are to do is because of the issues of under-coverage and non-response bias. Undercoverage occurs when you do not have a complete list of everyone in the population of interest. Non-response bias occurs when you randomly select individuals to participate in your study, but they refuse/decline to participate. A related issue is response bias which occurs when people’s answers to questions do not reflect the truth. See the exercises for more discussion and consideration of the implications of undercoverage, non-response and response bias.

Anecdotal Evidence

Contrast observational and experimental study designs to anecdotal evidence.

Anecdotal evidence is evidence based on a few personal observations or “anecdotes.” For example, you observe that your grandfather was a chronic smoker, ate lots of greasy food and never exercised, yet he lived to be 95 and was “never sick a day in his life.” If you make lifestyle decisions based on your observations of your grandfather you would be making decisions based on anecdotal evidence. Clearly, anecdotal evidence does not offer the benefits of generalizing a sample to a population or the benefits of cause-effect conclusions, but many people use anecdotal evidence to make decisions, whether they realize it or not!

(26)

(27)

Activity 2.3: More with Fathom

Let’s once again look at the results of the first day of class statistics student survey available in class_survey.ftm. Now we will revisit the results from this survey using tests of significance. As part of the survey, students were asked their gender, whether they had ever purchased textbooks online and whether they had ever been in a car accident while driving. In this activity we will explore whether purchasing textbooks online and being in a car accident while driving are associated with gender. Your instructor will provide you with the Fathom dataset needed to complete the following activity. 1. Let’s first explore whether there is a relationship between gender and buying

textbooks online. Note: Earlier in this chapter Activity 2.2B has specific Fathom instructions for creating cross-tabulation tables and scrambling, so instructions here assume you know and have read those instructions.

a) Is this an observational study or an experiment? Why?

b) Assume that the population of interest is all Hope students. Do you have a random sample of the population? Do you think the sample is representative of the population? Why or why not?

c) Fill in the 2x2 cross-tabulation table below.

Ever bought a textbook online? Total

Gender Yes No Female % ( / ) % ( / ) Male % ( / ) % ( / ) Total

d) What is the difference (female minus male) in percentages in your sample? Based on this, do you think that there is a difference, in the population, in the percentage of all Hope males and all Hope females who have ever bought a textbook online?

(28)

e) Explain what impact the lack of representativeness of your sample may have on your conclusion in (d). Specifically, how might the percent of females who’ve ever bought a textbook online be different in our sample than in the population? Why? How might the percent of males who’ve ever bought a textbook online be different in our sample than in the population? Why? What might be the impact on the difference in percentages? Why?

One sided vs. two-sided alternative hypotheses

At this point in the course, we’ve only considered tests of significance with a one-sided alternative hypothesis. Consider the swimming with dolphins study. The alternative hypothesis was:

Ha: People who swim with dolphins are more likely to improve on their depression

symptoms

This is a one-sided alternative hypothesis because we are only looking people who swim with dolphins to be more likely to improve on their depression symptoms. The two-sided version of this alternative hypothesis would be:

Ha: People who swim with dolphins will improve on their depression symptoms at

a different proportion than people who don’t swim with dolphins

In other words we’re just looking for any effect of swimming with dolphins on depression symptoms: positive or negative.

Key Idea: A two-sided alternative hypothesis looks for differences that are in either direction from the null hypothesis.

How do you carry out a two-sided hypothesis test?

Until now, when we found the p-value for a one-sided test, we looked at the number of outcomes that were as extreme or more extreme than what we observed. For example, in the dolphin study we looked to see how many times we observed a difference in proportions of at least 47% between the two groups when we scrambled. For a two-sided test we look at the opposite “tail” of the scramble distribution an equal distance away from the middle. For the dolphin study (and, in general, most studies), the middle of the scramble distribution is at zero, and so we would say not only how many times did we get at least 47%, but, also, how many times did we get less than or equal to -47%.

(29)

Figure 2.2 Shows that we are looking at both tails of the distribution when we compute the p-value. In this case, there are 13 times when we get a difference of 47% or greater and 15 times when we get a difference of -47% or less. Thus, the p-value is

13+15=28/1000=0.028. Because scramble distributions are typically symmetric, the two-sided test p-value is generally two-times as big as the p-value from the one-two-sided test. Figure 2.2. Illustrating a two-sided p-value for the Dolphins study

50 100 150 200 250 diffprops -0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4 0.6 0.8

Measures from Scrambled Dolphins Histogram

Why would you use a two-sided test instead of one-sided?

Two-sided tests are typically used more commonly in practice than one-sided tests. One reason for this is that two-sided tests are more conservative than one-sided tests. In other words, we typically get p-values that are about twice are large with a two-sided test and, thus, we have less evidence to reject the null hypothesis. Since it is harder to reject the null hypothesis we say two-sided tests are conservative. Another reason for using two-sided tests is so that the researcher is not biased to the results. For example, the researchers in the toy study could have used a two-sided test in case it turned out that, in fact, children preferred the hinderer toy. Because of issues with Type I and Type II errors (see Chapter 3) the decision to use a one- or two-sided alternative should be made prior to analyzing the data and based on a theoretical rationale and prior research.

f) State the null and alternative hypotheses comparing the proportion of males and females who have ever bought a textbook online. Use a two-sided alternative hypothesis. What does it mean to use a two-sided instead of a one-sided alternative?

(30)

g) Use the Fathom instructions above to make a dotplot of the difference in percentages from 1000 scrambles. What is the shape of the dotplot of the difference in percentages? What is its center? Why does it make sense for the center to be where it is?

Creating a measure in order to find a p-value using Fathom

In the dolphins activity, we had created a measure already. But, in general, this is something you will need to do yourself.

1. You need to create the appropriate measure for the hypothesis test (measure is the term Fathom uses to describe a summary statistic). In this case, the

measure we are interested in is the difference in the percentages of students who have ever bought a textbook online, comparing females to males. 2. To create this measure in Fathom, double click on your collection to open the

Inspector.

3. Click on the Measures tab. In the Measure column click on <new> and name your new measure “percentfemale” (for percentage of females who bought textbooks online.)

4. Double Click in the “Formula” box.

TECHNICAL NOTES: You probably want to drag the edges of your formula dialog box to make the window bigger; otherwise you will run off the side of the screen. Also, Fathom will, automatically add in closing parentheses and the second set of quotation marks, so type slowly and carefully!

5. Type the following:

proportion(text=”Yes”, gender=”Female”)

Then Click OK. You should see the percentage of females who bought textbooks online appear in the “value” box. Confirm that this matches what you got in part c.

6. Do steps 3-6 again, but now for the males, creating a new measure and naming it “percentmale”

7. Create a third measure which will be the difference in the two percentages. Name the measure “percentdiff” and for the formula type:

percentfemale-percentmale. Again, confirm that the value you get corresponds to your answer to part c.

8. Now that your measure is created you can follow the directions given in Activity 3.1B to scramble the dataset and collect measures. Use 1000 scrambles.

(31)

h) What is the p-value for your test?

i) What is your conclusion about whether males or females have bought textbooks online in different proportions? Make sure you relate your conclusion back to the population of interest (see your answer to part b).

2. Now, let’s explore whether there is a relationship between gender and getting in a car accident while driving.

a) Fill in the 2x2 cross-tabulation table below.

Ever been in a car accident while driving? Total Gender Yes No Female % ( / ) % ( / ) Male % ( / ) % ( / ) Total

b) What is the difference in percentages in the sample? Based on this, do you think that there is a difference, in the population, in the percentage of all Hope males and all Hope females who have ever been in a car accident while driving?

c) State the null and alternative hypothesis. Use a two-sided alternative hypothesis.

d) Create an appropriate measure and scramble the data. Collect measures and then create a dotplot of the difference in percentages. What is its center? Why does it make sense for the center to be where it is? Note: You’ll need to create measures and scramble to answer this question.

(32)

f) What is your conclusion about whether different percentages of males and females have gotten in a car accident while driving? Make sure you relate your conclusion back to the population of interest.

g) Explain what impact the lack of representativeness of your sample may have on your conclusion in (d). Specifically, how might the percent of females who’ve ever gotten in a car-accident be different in our sample than in the population? Why? How might the percent of males who’ve ever gotten in a car accident be different in our sample than in the population? Why? What might be the impact on the difference in percentages? Why?

(33)

Activity 2.3: Car Seats vs Seatbelts

Instructions: Watch the 2005 video by Stephen Levitt from ted.com (available on the textbook website) and then answer the following questions. If you’re interested in looking at other aspects of the world we live in from the view of Stephen Levitt, read his book, Freakonomics.

Table 2.9: Raw Data from Fatal Car Crashes

1. a. What kind of data is presented in Table 2.9? a. Observational

b. Experimental c. Anecdotal

d. None of the above

b. When presenting Table 2.9, Stephen Levitt says, “The theory tells you that the lap-only seat belt must be worse then the lap+shoulder belt. And this only reminds you that when dealing with raw data there are hundreds of confounding variables that get in the way.” How does the type of study explain Levitt’s statement?

(34)

1. Look at Figure 2.3 below. Figure 2.3

a) The heights of the three bars for “No controls” can be obtained directly from Table 2.9. Explain how. Hint: Notice what is on the y-axis in Figure 2.3.

b) Using your answer to 2(a) and the units on the y-axis, why is a “higher bar good.”

c) Why do you think lap only belts appear to fair better than car seats and lap+shoulder belts in the “no controls” set of bars?

(35)

In Math 210, we will be exploring statistical methods of analyzing up to two variables simultaneously, but in the follow-up course (Math 312) we consider methods of handling more than two variables simultaneously. One of the most powerful and common

techniques of analyzing more than two variables simultaneously involves controlling confounding/lurking variables in the statistical analysis. In short, controlling for other variables means accounting for confounding/lurking variables in such a way as to take them out of consideration in the analysis. For example, in the norovirus example earlier in this chapter we argued that, perhaps, the reason why males appeared to have a higher prevalence of norovirus was do to less hand washing. Figure 2.4 shows that gender may impact hand washing, which in turn impacts norovirus (the arrows mean “impacts” with response variables being pointed to by explanatory variables). The question is, once we account for (or control for) hand washing by drawing arrows #1 and #2, is there still an arrow between gender and norovirus (#3)? In other words, once we account for the fact that males wash their hands less, is there still a relationship between being male and getting norovirus.

Figure 2.4

2. In Figure 2.4, Levitt first shows a graph with “basic controls” that is, what the

difference in death rates are after controlling for a number of variables like the age of the child, how hard the crash was, and the seat the child was sitting in. The

“extensive controls” bars indicate difference in death rates when controlling for up to 100 different confounding/lurking variables in the analysis. Look again at Figure 2.4 and explain what happens when you compare lap+shoulder belts to car seats to lap only belts as you control for more and more confounding variables (extensive controls). Norovirus Hand Washing Gender 1 2 3

(36)

Figure 2.5

3. Figure 2.5 gives results similar to those in Figure 2.3, but for different types of crashes. Based on Figure 2.5:

a) Compare car seats to lap+shoulder belts for the three different types of crashes.

b) What reason does Levitt give for why car seats fare well in the frontal impact crashes, but not as well in other types of crashes?

4. Levitt says most people think the government wouldn’t have told us to use car seats if they didn’t work. Levitt argues, however, that the reason for passing the car seat laws is mainly based on the impassioned pleas of a few parents whose children died in car accidents. This is an example of a decision being made based on:

a. Statistical inference from an observational study b. A cause and effect conclusion from an experiment c. Anecdotal evidence

(37)

5. The crash-tests commissioned by Levitt are an example of: a. An observational study

b. An experiment c. Anecdotal evidence d. None of the above

6. Levitt argues that economists would think to look at the real-world crashes while “scientists” would rather go to the laboratory. Argue the pros and cons of both approaches.

7. Near the end of the video Levitt uses a story about “placebo pills” as an analogy for whether we will see integrated car seats in the future. Explain what the “placebo effect” is for an experimental study and why it argues for the use of a placebo-control group in many experiments. (Use Wikipedia or other resources to answer this

question, if necessary).

8. At the end of the video, Levitt is asked a question about the ability of car seats to prevent injury. Levitt answers in three parts, first that in his data the differences are insignificant comparing car seats and lap+shoulder belts. Second that in data on New Jersey crashes, including non-fatal accidents, there are small differences (10% reduction) but only for minor injuries. Third, that there is another way of gathering data based on making phone calls to people involved in accidents (as opposed to reported through government agencies) that reports very large improved effects of car seats vs. lap+ shoulder belts.

a) How does hearing this discussion of injuries change (or not) your reaction to Levitt’s presentation? How does it impact your thoughts on the need to be a careful consumer of statistical information?

b) Name some reasons why the alternative method of gathering data (phone calls to individuals in the crash) might look so different from Levitt’s data.

(38)

(39)

Chapter Summary

Spreadsheets are an integral part of every statistical package. They display

characteristics measured or observed on individuals. Each row represents an individual; thus the number of rows is the number of individuals in the data set. Each column represents a variable; thus the number of columns is the number of variables in the data set. We will use Fathom and PASW as statistical packages in this course.

Individuals, sometimes called subjects, are the people or objects we are interested in. When performing experiments, we often call them subjects. Variables are

characteristics measured or observed on individuals.

Categorical variables place individuals into categories. Quantitative variables are able to have arithmetic operations performed on them. In other words you can add, subtract, multiply and divide the values and the answers make sense.

Often when exploring relationships between two variables you can identify both an explanatory variable and a response variable. The explanatory variable is the variable that we think is explaining the change in the response variable.

We looked at how to use scrambling and Fathom to compare percentages from two different groups. The null hypothesis is that the percentages are the same in both groups, and the alternative hypothesis is that the percentages are different in the two groups. If we assume the null hypothesis is true, then we can scramble the explanatory variable many, many times and, each time, find the difference in the percentages of the two groups. We can then see how unlikely the difference in our percentages from our original sample is relative to the difference seen from scrambling (null hypothesis is true).

The way the sample is gathered and the type of study performed changes the types of conclusions that are possible. In experiments, researchers randomly assign subjects to treatment groups, which eliminates potential effects due to lurking variables and, thus, cause and effect conclusions are possible. In observational studies, lurking variables could be explaining the observed relationship between variables and thus cause-effect conclusions are not possible. Furthermore, when a sample is randomly selected from the population using a probabilistic technique (e.g. simple random sampling) the sample is typically representative of the population. When a sample is representative of the population, conclusions about the sample can be made to the population. There is no such assurance with convenience sampling.

(40)

Exercises

1. Shown below is a dataset on Hope students. A “picture” of the spreadsheet of the data is shown on the next page and immediately below is a list of all variables in the survey. Questions about this dataset follow on the next page.

Variables:

Gender (0=female, 1=male)

Class (1=freshman, 2=sophomore, 3=junior, 4=senior) Hometown (name of hometown)

Extra curricular (hours in last 4 weeks) Religious view change (0=no, 1=yes) Chapel satisfaction (0=no, 1=yes)

Chapel attendance (times attended in the last month)

West Michigan (Ottawa, Muskegon, Allegan or Kent counties; 0=no, 1=yes) Collection 1

gende r class hom etow n extracirricular religious_view s _change chapel_s atisfaction chapel_attendance w est_MI <new > 1 2 3 4 5 6 7 8 9 10 11 0 1 Syracuse, IN 15 0 1 12 0 0 1 Zeeland, MI 80 0 1 11 1

1 4 Dow ners Gro... 12 1 1 0 0

0 3 Midland, MI 40 0 1 5 0 1 4 Grand Rapids... 100 1 0 0 1 0 3 North Dorr, MI 8 1 1 1 1 0 1 Saline, MI 16 1 1 3 0 0 3 Wilmington, IL 10 1 0 0 0 0 3 Climax, MI 25 0 0 0 0 0 3 Marshall, MI 15 0 1 3 0 0 2 Hudsonville, MI 9 0 1 8 1 Collection 1

gende r class hom etow n extracirricular religious_view s _change chapel_s atisfaction chapel_attendance w est_MI <new > 78 79 80 81 82 83 84 85 86 87 0 3 Holland, MI 90 1 0 0 1 1 3 Jenison, MI 120 0 1 12 1 0 1 Oak Park, IL 5 0 1 12 0 1 4 San Fernand... 10 0 0 0 0 1 3 New ark, OH 1 0 0 0 1 2 Kalamazoo, MI 30 1 1 11 0 0 1 Mount Prospe... 0 0 1 0 0 0 4 Bay City, MI 15 0 0 0 0 0 4 Watertow n, WI 15 0 1 2 0 1 3 Burton, MI 10 1 1 0 0

a) How many variables are in this data set? b) How many individuals are in this data set? c) Which variables are categorical?

(41)

2. Below is a cross-tabulation table from a survey of 314 Hope students in 2008 that compares their class-standing to their “religiosity” which is their answer to the question “How religious are you?”

Not at all religious Not very religious Somewhat religious Quite religious Very religious Total Fresh 4 15 22 31 13 85 Soph 3 8 26 24 15 76 Junior 4 5 28 22 14 73 Senior 4 10 28 31 7 80 Total 15 38 104 108 49 314

Based on the table above answer the following questions:

c) What percent of this sample of Hope students say they are not very religious? d) What percent of this sample of Hope students are juniors?

e) What percent of juniors say that they are not very religious? What percent of seniors say they are not very religious?