• No results found

Finding the Sample Size for Studies About Proportions in Two Groups

This section presents the formula for estimating the approximate sample size needed in a study with two groups when the outcome is expressed in terms of proportions. Just as with studies involving two means, the researcher must answer four questions.

1. What is the desired level of significance (the α level) related to the null hypothesis?

2. What should be the chance of detecting an actual difference, that is, what the desired power (1 – β) to be associated with the alternative hypothesis?

3. How large should the difference be between the two proportions for it to be clinically significant?

4. What is a good estimate of the standard deviation in the population? For a proportion, it is easy: The null hypothesis assumes the proportions are equal, and the proportion itself determines the estimated standard deviation: π (1 – π).

To simplify matters, we again assume that the sample sizes are the same in the two groups. The symbol π1 denotes the proportion in one group, and π2 the proportion in the other group. Then, the formula for n is

where zα is the two-tailed z value related to the null hypothesis and zβ is the lower one-tailed z value related to the alternative hypothesis.

To illustrate, we use the study by Lapidus and colleagues (2002) of screening for domestic violence. Among physicians with training in DV, 175 of 202 reported they screen routinely or selectively (0.866), compared with 155 of 266 physicians without DV training (0.583). We found that the 95% confidence interval for the difference in proportions was 0.201 to 0.357, and because the interval does not contain 0, we concluded a difference existed in the proportion who screen for DV. Suppose that the investigators, prior to doing the study, wanted to estimate the sample size needed to detect a significant difference if the proportions who screened were 0.85 and 0.55. They are willing to accept a type I error (or falsely concluding that a

P.155

difference exists when none really occurred) of 0.05, and they wanted a 0.90 probability of detecting a true difference (ie, 90%

power).

BOX 6-1. TWO-SAMPLE T TEST POWER ANALYSIS FOR PULSE OXIMETRY.

P.156 P.157 P.158

Numeric results for two-sample T-test

Null Hypothesis: Mean1=Mean2

Alternative Hypothesis: Mean1<>Mean2

The standard deviations were assumed to be known and equal.

Power N1 N2 Allocation Ratio Alpha Beta Mean1 Mean2 Sigma1 Sigma2

0.55894 50 200 4.000 0.05000 0.44106 95.0 93.0 6.0 6.0

0.63663 60 240 4.000 0.05000 0.36337 95.0 93.0 6.0 6.0

0.70350 70 280 4.000 0.05000 0.29650 95.0 93.0 6.0 6.0

0.76013 80 320 4.000 0.05000 0.23987 95.0 93.0 6.0 6.0

0.80743 90 360 4.000 0.05000 0.19257 95.0 93.0 6.0 6.0

0.84648 100 400 4.000 0.05000 0.15352 95.0 93.0 6.0 6.0

0.87839 110 440 4.000 0.05000 0.12161 95.0 93.0 6.0 6.0

0.90423 120 480 4.000 0.05000 0.09577 95.0 93.0 6.0 6.0

0.92498 130 520 4.000 0.05000 0.07502 95.0 93.0 6.0 6.0

0.94152 140 560 4.000 0.05000 0.05848 95.0 93.0 6.0 6.0

0.95463 150 600 4.000 0.05000 0.04537 95.0 93.0 6.0 6.0

The two-tailed z value related to α is +1.96, and the lower one-tailed z value related to β is -1.645, the value that separates the lower 10% of the z distribution from the upper 90%. Then, the estimated sample size is

We use the nQuery program with data from Lapidus and colleagues to illustrate finding the sample size for the difference in two proportions. The table and plot produced by nQuery are given in Figure 6-7 and indicate that n needs to be slightly larger than our estimate.

Figure. No Caption available.

Source: Data, used with permission of the authors and publisher, Kline JA, Nelson RD, Jackson RE, Courtney DM: Criteria for the safe use of D-dimer testing in emergency department patients with suspected pulmonary embolism: A multicenter US study. Ann Emerg Med 2002;39:144 –1524. Analysis produced with NCSS; used with permission.

SUMMARY

This chapter has focused on statistical methods that are useful in determining whether two independent groups differ on an outcome measure. In the next chapter, we extend the discussion to studies that involve more than two groups.

The t test is used when the outcome is measured on a numerical scale. If the distribution of the observations is skewed or if the standard deviations in the two groups are different, the Wilcoxon rank sum test is the procedure of choice. In fact, it is such a good substitute for the t test that some statisticians recommend it for almost all situations.

The chi-square test is used with counts or frequencies when two groups are being analyzed. We discussed what to do when sample sizes are small, commonly referred to as small expected frequencies. We recommend Fisher's exact test with a 2 × 2 table. We briefly touched on some other issues related to the use of chi-square in medical studies.

In Presenting Problem 1, Kline and his colleagues (2002) wanted to know if patients who experienced a pulmonary embolism (PE) differed from those who did not, and they looked at several outcomes. The researchers found a difference in heart rate, systolic blood pressure, pH, and pulse oximetry. Patients who had a PE had higher heart rates and pH, but lower systolic blood pressure and pulse oximetry. A goal of their study was to find a decision rule that would divide patients with suspected PE into a high risk group in which the D-dimer test should not be used and a low-risk group in which the test is appropriate. We revisit their study in Chapter 12.

We used the study by Harper (1997) to illustrate the t test for two independent groups. Harper wanted to know whether women undergoing cryosurgery who had a paracervical block before the surgery experienced less pain and cramping than women who did not have the block. We compared the scores that women assigned to the degree of cramping they experienced with the procedure.

Women who had the paracervical block had significantly lower scores, indicating they experienced less severe cramping. We used the same data to illustrate the Wilcoxon rank sum test and came to the same conclusion. The Wilcoxon test is recommended when assumptions for the t test (normal distribution, equal variances) are not met. The investigator reported that women receiving paracervical block perceived less cramping than those that did not receive it, a result that is consistent with our analysis. The paracervical block did not decrease the perception of pain, however.

Turning to research questions involving nominal or categorical outcomes, we introduced the z statistic for comparing two proportions and the chi-square test. In Lapidus and colleagues' (2002) study, investigators were interested in learning whether training in domestic violence (DV) and subsequent screening patients for DV were related. We used the same data to illustrate the construction of confidence intervals and the z test for two proportions and came to the same conclusion, illustrating once more the

Figure 6-7. Two-sample test for proportions power analysis using nQuery Advisor; used with permission.

equivalence between the conclusions reached using confidence intervals and statistical tests.

The chi-square test uses observed frequencies and compares them to the frequencies that would be expected if no differences existed in proportions. We again used the data from Lapidus and colleagues (2002) to illustrate the chi-square test for two groups, that is, for observations that can be displayed in a 2 × 2 table. Once more, the results of the statistical test indicated that a difference existed in proportions of physicians who screened for DV, depending on whether they had been trained to do so.

The importance of sample size calculations was again stressed. We illustrated formulas and computer programs that estimate the sample sizes needed when two independent groups of subjects are being compared.

A summary of the statistical methods discussed in this chapter is given in Appendix C.

EXERCISES

1. How does a decrease in sample size affect the confidence interval? Recalculate the confidence interval for pulse oximetry in the section

titled, “Decisions About Means in Two Independent Groups,” assuming that the means and standard deviations were the same but only 25 patients were in each group. Recalculate the pooled standard deviation and standard error and repeat the confidence interval. Is the conclusion the same?

P.159

2. Calculate the pooled standard deviation for the total cramping score from Table 6-3.

3. Good and colleagues (1996) used the Barthel index (BI) to measure mobility and activities of daily living in a group of patients who had sleep apnea. This breathing disorder is characterized by periodic reductions in the depth of breathing (hypopnea), periodic cessation of breathing (apnea), or a continuous reduction in ventilation. The Barthel index, a standardized scale that measures mobility and activities of daily living, was recorded at admission, at discharge, and at 3 and 12 months after stroke onset. Data files are on the CD-ROM in a folder called “Good.”

a. Did patients with a desaturation index (DI) < 10 have the same mean BI at discharge as patients with a DI ≥ 10? Answer this question using a 95% confidence interval.

b. Did a significant increase occur in BI from the time of admission until discharge for all the patients in the study (ie, ignoring the desaturation index)? Answer this question using a 95% confidence interval.

4. Show that the pooled standard deviation for two means is the average of the two standard deviations when the sample sizes are equal.

5. Use the data from Kline and colleagues (2002) to compare pulse oximetry in patients who did and those who did not have a PE. Compare the conclusion with the confidence interval in the section titled, “Decisions About Means in Two

Independent Groups.”

6. Use the rules for finding the probability of independent events to show why the expected frequency in the chi-square statistic is found by the following formula:

7. How was the rule of thumb for calculating the sample size for two independent groups found?

8. Refer to the study by Good and colleagues (1996) on patients with stroke. How large a sample is needed to detect a difference of 0.85 versus 0.55 in the proportions discharged home with 80% power?

9. Compute the 90% and 99% confidence intervals for the difference in pulse oximetry for the patients with and without PE (Kline, 2002). Compare these intervals with the 95% interval obtained in the section titled, “Comparing Two Means Using Confidence Intervals.” What is the effect of lower confidence on the width of the interval? Of higher confidence?

10. Suppose investigators compared the number of cardiac procedures performed by 60 cardiologists in large health centers during one year to the number of procedures done by 25 cardiologists in midsized health centers. They found no significant difference between the number of procedures performed by the average cardiologist in large centers and those performed in midsized centers using the t test. When they reanalyzed the data using the Wilcoxon rank sum test, however, the investigators noted a difference. What is the most likely explanation for the different findings?

11. Benson and colleagues (1996) designed a randomized clinical trial to learn whether a vaginal or an abdominal approach is more effective in surgically treating severe uterovaginal prolapse. Over a 2-year period, women were assigned on the basis of a random number table to have pelvic reconstruction surgery by either a vaginal or an abdominal approach. Surgical outcomes were noted as optimally effective, satisfactorily effective, or unsatisfactorily effective based on an assessment of prolapse symptoms and integrity of the vaginal support during a Valsalva strain maneuver. The patients were examined postoperatively at 6 months and then annually for up to 5 years. Other outcome measures included charges for hospital stay, length of stay, and time required in the operating room. Data from this study are given in Table 6 -10 and on the CD-ROM.

Perform an appropriate statistical procedure to answer the following questions:

a. Do the groups show a difference in the operating room time?

b. Are the variances of operating room times similar in both groups?

Table 6-10. Means and standard deviations on variables from the study on reconstructive surgery for pelvic defects.

Parity Vaginal

Source: Reproduced, with permission, from Benson JT, Lucente V, McClellan E: Vaginal versus abdominal reconstructive surgery for the treatment of pelvic support defects: A prospective randomized study with long -term outcome evaluation. Am J Obstet Gynecol 1996;175: 1418–1422.

P.160 12. When testing the variances of pH for those who had a PE and those who did not (Kline, 2002), the F test indicated the

variances were unequal, but the Levene test indicated they were not? What is the most likely explanation for this seeming contradiction? Use the data on the CD-ROM to form histograms or box plots for the two groups. What do you notice?

13. Recall that in Chapter 5 exercises we examined box plots for daily juice consumption by 2- and 5-year-olds (Dennison et al, 1997). We asked you to say whether you thought the two groups drank different amounts of juice. Now, use the t test to learn if the means are different.

14. Group Exercise. Many older patients use numerous medications, and, as patients age, the chances for medication errors increases. Gurwitz and colleagues (2003) undertook a study of all Medicare patients seen by a group of physicians (multispecialty) during 1 year. The primary outcomes were number of adverse drug events, seriousness of the events, and whether they could have been prevented. Obtain a copy of the article to answer the following questions.

a. What was the study design? Why was this design particularly appropriate?

b. What methods did the investigators use to learn the outcomes? Were they sufficient?

c. What statistical approach was used to evaluate the outcome rates?

d. What statistical methods were used to analyze the characteristics in Table 1 in the Gurwitz study?

e. What was the major conclusion from the study? Was this conclusion justified? Would additional information help readers decide whether the conclusion was appropriate?

15. Group Exercise. Physicians and dentists may be at risk for exposure to blood -borne diseases during invasive surgical procedures. In a study that is still relevant, Serrano coworkers (1991) wanted to determine the incidence of glove perforation during obstetric procedures and identify risk factors. The latex gloves of all members of the surgical teams performing cesarean deliveries, postpartum tubal ligations, and vaginal

deliveries were collected for study; 100 unused gloves served as controls. Each glove was tested by inflating it with a pressurized air hose to 1.5–2 times the normal volume and submerging it in water. Perforations were detected by the presence of air bubbles when gentle pressure was applied to the palmar surface. Among the 754 study gloves, 100 had holes; none of the 100 unused control gloves had holes. In analyzing the data, the investigators found that 19 of the gloves with holes were among the 64 gloves worn by scrub technicians. Obtain a copy of this paper from your medical library and use it to help answer the following questions:

a. What is your explanation for the high perforation rate in gloves worn by scrub technicians? What should be done about these gloves in the analysis?

b. Are there other possible sources of bias in the way this study was designed?

c. An analysis reported by the investigators was based on 462 gloves used by house staff. The levels of training, number of gloves used, and number of gloves with holes were as follows: Interns used 262 gloves, 30 with holes; year 2 residents used 71 gloves, 9 with holes; year 3 residents used 58 gloves, 4 with holes; and year 4 residents used 71 gloves, 17 with holes. Confirm that a relationship exists between training level and proportion of perforation, and explain the differences in proportions of perforations.

d. What conclusions do you draw from this study? Do your conclusions agree with those of the investigators?

P.161

Footnotes

aTo be precise, the confidence interval is interpreted as follows: 95% of such confidence intervals contain the true difference between the two means if repeated random samples of operating room times are selected and 95% confidence intervals are calculated for each sample.

bAs an aside, the different names for this statistic occurred when a statistician, Wilcoxon, developed the test at about the same time as a pair of statisticians, Mann and Whitney. Unfortunately for readers of the medical literature, there is still no agreement on which name to use for this test.

cThe z test for the difference between two independent proportions is actually an approximate test. That is why we must assume the proportion times the sample size is > 5 in each group. If not, we must use the binomial distribution or, if np is really small, we might use the Poisson distribution (both introduced in Chapter 4).

dWe say small amounts because of what is called sampling variability—variation among different samples of patients who could be randomly selected for the study.

Editors: Dawson, Beth; Trapp, Robert G.

Title: Basic & Clinical Biostatistics, 4th Edition Copyright ©2004 McGraw -Hill

> T abl e o f C on ten t s > 7 - Res ear c h Qu es t io ns A bou t M ean s i n T h r ee or Mo r e G r ou ps

7