• No results found

Computer Programs for Finding Sample Sizes

Using data from Dennison and coworkers (1997), we use the SamplePower program to calculate the sample size for a study involving one mean. Output from the program is given in Figure 5-10. (If you use this program, you can automatically get the sample size for 80% power by clicking on the binoculars icon in the tool bar.) SamplePower indicates we need n of 73, close to the value we calculated of 71. This program also generates a verbal statement (by pressing the icon that has lines on it and indicates

it produces a report). Part of a power statement is also reproduced in Figure 5-10.

Figure 5-10. Computer output from the SamplePower program estimating a sample size for the mean juice consumption in 2-year-old children. (Data, used with permission, from Dennison BA, Rockwell HL, Baker SL: Excess fruit juice consumption by preschool-aged children is associated with short stature and obesity. Pediatrics 1997;99:15–22. Table produced with SamplePower 1.00, a registered trademark of SPSS, Inc.; used with permission.)

P.129 P.130 P.131

BOX 5-3. COMPUTER OUTPUT FROM THE PASS PROGRAM ESTIMATING A SAMPLE SIZE FOR THE NUMBER OF PATIENTS NEEDED IN THE STUDY OF CHOLECYSTECTOMY.

Figure 5-11. Computer output from the nQuery program estimating the sample size for a proportion. (Observations based on Frey SE, Couch RB, Tacket CO, Treanor JJ, Wolff M, Newman FK, et al: Clinical responses to undiluted and diluted smallpox vaccine. N Engl J Med 2002;346:1265–1274. Analysis produced with nQuery; used with permission.)

One-Sample t-Test Power Analysis

Numeric results for one-sample T-test

Null hypothesis: Mean0 = Mean1

Alternative hypothesis: Mean0 <> Mean1

Unknown standard deviation

Power N Alpha Beta Mean0 Mean1 S Effect Size

0.21133 5 0.05000 0.78867 20.00 0.0 30.0 0.667

0.46923 10 0.05000 0.53077 20.00 0.0 30.0 0.667

0.67086 15 0.05000 0.32914 20.00 0.0 30.0 0.667

0.80729 20 0.05000 0.19271 20.00 0.0 30.0 0.667

To find the sample size for a proportion, we use the nQuery program with data from Frey and coworkers (2002). Output from this procedure is given in Figure 5-11 and states that 250 patients will provide 81% power. nQuery also generates a statement, included in Figure 5-10, as well as a graph.

Finally, we illustrate the output from the PASS program for finding the sample size for a mean. The program for one mean can be used for a paired design, and we show this with data from Sauter and colleagues (2002). Output from this procedure is given in Box

0.89202 25 0.05000 0.10798 20.00 0.0 30.0 0.667

0.94158 30 0.05000 0.05842 20.00 0.0 30.0 0.667

0.96929 35 0.05000 0.03071 20.00 0.0 30.0 0.667

0.98424 40 0.05000 0.01576 20.00 0.0 30.0 0.667

Summary Statements

A sample size of 20 achieves 80% power to detect a difference of 20.0 between the null hypothesis mean of 20.0 and the alternative hypothesis mean of 0.0 with an estimated standard deviation of 30.0 and with a significance level (alpha) of 0.05 using a two-sided one-sample t-test.

Figure. Chart Section.

Source: Data, used with permission, from Sauter GH, Moussavian AC, Meyer G, Steitz HO, Parhofer KG, Jungst D: Bowel habits and bile acid malabsorptin in the months after cholecystectomy. Am J Gastroenterol 2002;97(2):1732–35. Analyzed with PASS;

used with permission.

5-3 and indicates that a sample size of 20 is needed to conclude that the observed difference in 7α-HCO is significant at P < 0.05.

PASS also provides a graph of the relationship between power and sample size and generates a statement.

SUMMARY

This chapter illustrated several methods for estimating and testing hypotheses about means and proportions. We also discussed methods to use in paired or before-and-after designs in which the same subjects are measured twice. These studies are typically called repeated-measures designs.

We used observations on children whose juice consumption and overall energy intake was studied by Dennison and coworkers (1997). We formed a 95% confidence interval for the mean fruit juice consumed by 2-year-old children and found it to be 4.99–6.95 oz/day. We illustrated hypothesis testing for the mean in one group by asking whether the mean energy intake in 2 -year-olds was different from the norm found in a national study and showed the equivalence of conclusions when using confidence intervals and hypothesis tests.

In the study published by Frey and colleagues (2002), the investigators found that initial vaccination was successful in 665 of 680 subjects (97.8%); in the group receiving the 1:10 dilution the proportion was 330 of 340 (97.1%). We used data from this study to illustrate statistical methods for a proportion. The authors concluded that vaccinia virus can be diluted to a titer as low as 1:10 and induce local viral replication and vesicle formation in more than 97% of persons; this suggests that the current stocks of smallpox vaccine in the United States could potentially protect nearly 10 times as many people as undiluted vaccine.

To illustrate the usefulness of paired or before-and -after studies, we used data from the study by Sauter and colleagues (2002) in which bile acid absorption and bowel habits were examined before and after cholecystectomy. We analyzed change in 7 α-HCO at baseline and after 1 month and used the t statistic to form a 95% confidence interval for the change. Second, we performed a paired t test for the change in 7α-HCO and found that the difference was statistically significant. The investigators reported that after cholecystectomy there was an increase in patients reporting more than one bowel movement per day and those reporting loose stools. Despite significant increases in serum levels of 7α-HCO at 1 and 3 months after surgery, there was no relationship between changes in these levels and changes in bowel habits or occurrence of diarrhea. These results indicate that changes in bowel habits frequently occur after cholecystectomy but that bile acid malabsorption does not appear to be the predominant pathogenic factor in PCD.

Yuan and colleagues (2001) showed that MRI can identify lipid-rich necrotic cores and intraplaque hemorrhage in atherosclerotic plaques with high sensitivity and specificity. We used the data to illustrate agreement between two procedures with the κ statistic and found a good level of agreement. The investigators hope that this noninvasive technique will be a useful tool in lipid -lowering clinical trials and in determining prognosis in patients with carotid artery disease.

On occasion, investigators want to know whether the proportion of subjects changes after an intervention. In this situation, the McNemar test is used, as with changes in stool frequency status in the study of cholecystectomy by Sauter and colleagues (2002).

We explained alternatives methods to use when observations are not normally distributed. Among these are several kinds of transformations, with the log (logarithmic transformation) being fairly common, and nonparametric tests. These tests make no assumptions about the distribution of the data. We illustrated the sign test for testing hypotheses about the median in one group and the Wilcoxon signed rank test for paired observations, which has power almost as great as that of the t test.

We concluded the chapter with a discussion of the important concept of power. We outlined the procedures for estimating the sample size for research questions involving one group and illustrated the use of three statistical programs that make the process much easier.

In the next chapter, we move on to research questions that involve two independent groups. The methods you learned in this chapter are not only important for their use with one group of subjects, but they also serve as the basis for the methods in the next chapter.

A summary of the statistical methods discussed in this chapter is given in Appendix C. These flowcharts can help both readers and researchers determine which statistical procedure is appropriate for comparing means.

EXERCISES

P.132

1. Using the study by Dennison and coworkers (1997), find the 99% confidence interval for the mean fruit juice consumption among 2-year-olds and compare the result with the 95% confidence interval we found (4.99–6.95).

a. Is it wider or narrower than the confidence interval corresponding to 95% confidence?

b. How could Dennison and coworkers increase the precision with which the mean level of juice consumption is measured?

c. Recalculate the 99% confidence interval assuming the number of children is 200. Is it wider or narrower than the confidence interval corresponding to 95% confidence?

2. Using the study by Dennison and coworkers, test whether the mean consumption of soda in 2-year-olds differs from zero.

What is the P value? Find the 95% confidence interval for the mean and compare the results to the hypothesis test.

3. Using the Dennison and coworkers study, determine the sample size needed if the researchers wanted 80% power to detect a difference of ≥ 2 oz in fruit juice consumption among 2 -year-olds (assuming the standard deviation is 3 oz). Compare the results with the sample size needed for a difference of 1 oz.

4. What sample size is needed if Frey and coworkers (2002) wanted to know if an observed 97% of patients with an initial success to vaccination is different from an assumed norm of 90%? How does this number compare with the number we found assuming a rate of 95%?

5. Our calculations indicated that a sample size of 71 is needed to detect a difference of ≥1 oz from an assumed mean of 5 oz in the Dennison and coworkers study, assuming a standard deviation of 3 oz. Dennison and coworkers had 94 children in their study and found a mean juice consumption of 5.97 oz. Because 94 is larger than 71, we expect that a 95% CI for the mean would not contain 5. The CI we found was 4.99–6.95, however, and because this CI contains 5, we cannot reject a null hypothesis that the true mean is 5. What is the most likely explanation for this seeming contradiction?

6. Using the data from the Sauter and colleagues study (2002), how large would the mean difference need to be to be considered significant at the 0.05 level if only ten patients were in the study? Hint: Use the formula for one mean and solve for the difference, using 30 as an estimate of the standard deviation.

7. Two physicians evaluated a sample of 50 mammograms and classified them as negative (needing no follow-up) versus positive (needing follow-up). Physician 1 determined that 30 mammograms were negative and 20 were positive, and physician 2 found 35 negative and 15 positive. They agreed that 25 were negative. What is the agreement beyond chance?

8. Use the data from the Sauter and colleagues study to determine if a change occurs in HDL after 3 months (HDL3DIFF).

a. First, examine the distribution of the changes in HDL. Is the distribution normal so we can use the paired t test, or is the Wilcoxon test more appropriate?

b. Second, use the paired t test to compare the before-and-after measures of HDL; then, use the t test for one sample to compare the difference to zero. Compare the answers from the two procedures.

9. Using the Canberra Interview for the Elderly (CIE), Henderson and colleagues (1997) collected data on depressive symptoms and cognitive performance for 545 people. The interview was given at baseline and again 3 –4 years later. The CIE reports the depression measure on a scale from 1 to 17.

a. Use the data set in the folder entitled “Henderson ” on the CD-ROM to examine the distribution of the depression scores at baseline and later. What statistical method is preferred for determining if a change occurs in depression scores?

b. We recoded the depression score as depressed versus not depressed. Use the McNemar statistic to see if the proportion of depressed people is different at the end of the study.

c. Do the conclusions agree? Discuss why or why not.

10. Dennison and coworkers also studied 5-year-old children. Use the data set in the CD-ROM folder marked “Dennison” to evaluate fruit juice consumption in 5-year-olds.

a. Are the observations normally distributed?

b. Perform the t test and sign test for one group. Do these two tests lead to the same conclusion? If not, which is the more appropriate?

c. Produce a box plot for 2-year-olds and for 5-year-olds and compare them visually. What do you think we will learn when we compare these two groups in the next chapter?

11. If you have access to the statistical program Visual Statistics, use the Discrete Distributions module to see how the

distribution changes as the proportion and the sample size change. What happens as the proportion gets closer to 0? to 0.5? to 1? And what happens as the sample size increases? Decreases? Try some situations in which the proportion times the sample size is quite small (eg, 0.2 × 10). What happens to the shape of the distribution then?

12. Group Exercise. Congenital or developmental dysplasia of the hip (DDH) is a common pediatric affliction that may precede arthritic deformities of the hip in adult patients. Among patients undergoing total hip arthroplasty, the prevalence of DDH is 3–

10%. Ömeroğlu and colleagues (2002) analyzed a previously devised radiographic classification system for the shape of the acetabular roof. The study design required that four orthopedic surgeons independently evaluate the radiographs of 33 patients who had previously been operated on to treat unilateral or bilateral DDH. They recorded their measurements independently on two separate occasions during a period of 1 month. You may find it helpful to obtain a copy of the article to help answer the following questions.

a. What was the study design? Was it appropriate for the research question?

b. How did the investigators analyze the agreement among the orthopedic surgeons? What is this type of agreement called?

c. How did the investigators analyze the agreement between the measurements made on two separate occasions by a given orthopedic surgeon?

d. Based on the guidelines presented in this chapter, how valuable is the classification system?

13. Following is a report that appeared in the April– June 1999 Chance News from the Chance Web site at http://www.dartmouth.edu/~chance/chance_news/recent_news/chance_news_8.05.html#polls

Read the information and answer the discussion questions. “Election Had Too Many Polls and Not Enough Context.”

We do not often see a newspaper article criticizing the way it reports the news but this is such an article. Schachter writes about the way newspapers confuse the public with their tracking of the polls. He starts by commenting that the polls are “crude instruments which are only modestly accurate.” The truth is in the margin of error, which is “ritualistically repeated in the boilerplate paragraph that newspapers plunk about midway through poll stories (and the electronic media often ignore).”

He remarks that when the weather forecaster reports a 60% chance of rain tomorrow, few people believe the probability of rain is exactly 60%. But when a pollster says that 45% of the voters will vote for Joe Smith, people believe this and feel that the poll failed if Joe got only 42%. They also feel that something is wrong when the polls do not agree.

Schacter reviews how the polls did in the recent Ottawa election and finds that they did quite a good job taking the margin of error into account—“much better than the people reporting, actually.”

In a more detailed analysis of the polls in this election, Schachter gives examples to show that, when newspapers try to explain each chance fluctuation in the polls, they often miss the real reason voters change their minds.

Schachter concludes by saying:

It's amusing to consider what might happen if during an election one media outlet reported all the poll results as a range. Instead of showing the Progressive Conservatives at 46%, for example, the result would be shown as 44–

50%. That imprecision would silence many of the pollsters who like to pretend they understand public opinion down to a decimal point. And after the initial confusion, it might help the public to see polls for what they are:

useful, but crude, bits of information.

--(Harvey Schachter, The Ottawa Citizen, 5 June, 1999)

DISCUSSION QUESTIONS

P.133

1. What do you think about the idea of giving polls as intervals rather than as specific percentages? Would this help also in weather predictions?

2. Do you agree that weather predictions of the temperature are understood better than poll estimates? For example, what confidence interval would you put on a weather predictor's 60% chance for rain?

Footnotes

aWhere does the value of 26.4 come from? Recall from Chapter 4 that the standard error of the mean, SE, is the standard deviation of the mean, not the standard deviation of the original observations. We calculate the standard error of the mean by dividing the standard deviation by the square root of the sample size:

bRemember from your high school or college math courses that the log of a number is the power of 10 that gives the number. For example, the log of 100 is 2 because 10 raised to the second power (102) is 100, the log of 1000 is 3 because 10 is raised to the third power (103) to obtain 1000, and so on. We can also think about logs as being exponents of 10 that must be used to get the number.

Editors: Dawson, Beth; Trapp, Robert G.

Title: Basic & Clinical Biostatistics, 4th Edition Copyright ©2004 McGraw -Hill

> T abl e o f C on ten t s > 6 - Res ear c h Qu es t io ns A bou t T w o S epar a te or I nd epen den t G r ou ps

6