Sample size calculations for controlling the distribution of false discovery proportion in microarray experiments

(1)

doi:10.1093/biostatistics/kxp024

Advance Access publication on July 23, 2009

Sample size calculations for controlling the distribution of

false discovery proportion in microarray experiments

TOMONORI OURA∗

Department of Biostatistics, Kyoto University School of Public Health, Yoshidakonoe-cho, Sakyo-ku, Kyoto 606-8501, Japan

[email protected] SHIGEYUKI MATSUI

Department of Data Science, The Institute of Statistical Mathematics, 4-6-7 Minami-Azabu, Minato-ku, Tokyo 106-8569, Japan

KOJI KAWAKAMI

Department of Pharmacoepidemiology, Kyoto University School of Public Health, Yoshidakonoe-cho, Sakyo-ku, Kyoto 606-8501, Japan

SUMMARY

The false discovery proportion (FDP), the proportion of false rejections among all rejections, provides useful criteria for controlling false positives in multiple testing to detect differential genes in microarray experiments. Owing to a substantial variability in FDP for correlated genes, some authors considered controlling actual FDP, instead of its expectation, that is false discovery rate, in multiple testing. However, there has been no attempt to do this in the design of microarray experiments. In this article, we develop a procedure for sample size calculation to control the distributions of FDP and true positives simultaneously under blockwise correlation structures among genes. The sizes of gene blocks, correlation coefficients, and effect sizes within gene blocks can vary across gene blocks. Gene clustering is proposed to identify gene blocks using historical data sets. The adequacy of the procedure is demonstrated using simulated data sets. An application to a clinical study for lymphoma is also provided.

Keywords: False discovery proportion; Gene expression; Microarray; Sample size.

1. INTRODUCTION

DNA microarrays have been widely used for screening for differentially expressed (DE) genes among dif-ferent phenotypes such as clinical subtypes or prognostic classes of disease from a large pool of candidate genes. For detecting differential genes in microarray experiments, appropriate adjustment for multiple testing on a large number of genes is critical. False discovery proportion (FDP), defined as the proportion

∗_{To whom correspondence should be addressed.} c

at Pennsylvania State University on March 5, 2016

http://biostatistics.oxfordjournals.org/

(2)

of false rejections among all rejections, provides useful criteria for controlling false positives in ex-ploratory microarray experiments. Many multiple testing procedures have been developed for controlling the expectation of FDP, that is false discovery rate (FDR) (e.g. Benjamini and Hochberg, 1995; Storey, 2002). However, owing to a substantial discrepancy between FDR and actual FDP when the correlation between genes increases, the control of FDR can provide a false sense of security (Korn and others, 2004; Pawitan and others, 2006). Some authors therefore consider controlling the actual FDP, taking into ac-count the variability of FDP in multiple testing (Korn and others, 2004, 2007; Genovese and Wasserman, 2004, 2006). Specifically, in these procedures, the probability that FDP is less than a specified level, for example 10%, is controlled.

The determination of sample sizes for microarray experiments is an important issue, particularly, for controlling statistical power in detecting truly differential genes or obtaining true positives. Recently, many authors have considered this issue for controlling FDR (e.g. Tsai and others, 2005; Pawitan and others, 2005; Jung, 2005; Shao and Tseng, 2007). However, if a multiple testing procedure with a safer criterion that controls actual FDP, rather than FDR (such as that proposed by Korn and others, 2004), is employed for a microarray data set whose sample size is determined by a sample size calculation for controlling FDR, we may fail to keep the statistical power for detecting true positives (as demonstrated in Section 3). The control of actual FDP is generally more stringent than the control of FDR. Therefore, when one needs to control actual FDP, a sample size calculation for this requirement is needed.

In this article, we develop a procedure for sample size calculation to control actual FDP in microarray experiments. Because the variability in the proportion of true positives or sensitivity can be substantial for correlated genes (Shao and Tseng, 2007), we also control the probability that the sensitivity is greater than a specified level. Shao and Tseng (2007) call this probability “overall power.” As such, our procedure involves the combination of 2 separate threads of research: (1) use of FDP, rather than FDR, to control more stringently the detection of differential genes, as first proposed by Korn and others (2004), and (2) sample size calculations to control overall power when genes are correlated, as proposed by Shao and Tseng (2007). We assume that an assemblage of genes (i.e. gene block) has a common correlation coefficient and a common effect size on phenotypic classes, but the sizes of the gene blocks, the correlation coefficients, and the effect sizes within each gene block can vary across gene blocks. We also propose a simple procedure based on gene clustering to identify gene blocks and estimate design parameters using historical data sets. Lastly, the application to the microarray data set from a lymphoma clinical study is provided.

2. METHODS

2.1 The framework of multiple testing

Suppose that out of m multiple tests for detecting DE genes, the null hypotheses(H0) are true for m0tests

and the alternative hypotheses(H1) are true for m1(=m − m0) tests. For the comparisonwise type I error

rateα for each test, the multiple tests declare that, of the m0null hypotheses, R0hypotheses are rejected

(false positives) and, of the m1 alternative hypotheses, R1hypotheses are rejected (true positives). The

results of m tests are summarized in Table 1. To control false positives, we consider the FDP, defined as:

FDP= R0

R0+ R1

. (2.1)

To control true positives, which can be ensured by sample size calculations, we consider the following proportion, called “sensitivity”:

Se= R1

m1

. (2.2)

(3)

Table 1. Outcomes of multiple testing True hypothesis Reject H0 Accept H0 Total

H0is true R0 m0− R0 m0

H₁is true R₁ m₁− R₁ m₁

Total R m− R m

Note that FDP and Se are random variables because they are functions of the random variables R0and R1.

The expectation of FDP is the FDR. Specifically, FDR= E(FDP) for R = R0+ R1> 0 and FDR = 0 for

R= 0. As in Jung (2005), we focus on the case Pr(R > 0) ≈ 1. The expectation of Se is called “average

power” (Jung, 2005; Shao and Tseng, 2007). In our sample size calculation, we control actual FDP and Se, taking into account the variability of FDP and Se.

2.2 The procedure for sample size calculation

In what follows, we use the term “class” to represent a phenotypic class for which the relation with gene expressions will be investigated using multiple testing. The terms “block” and “gene block” are reserved for an assemblage of genes with a common correlation coefficient and a common effect size on classes.

The gene expression data considered here are normalized log ratios from 2-color complementary DNA (cDNA) arrays or normalized log signals from oligonucleotide arrays (e.g. Affymetrix GeneChip). For simplicity, we consider a comparison of gene expression of 2 classes on the phenotype using a standard two-sample t statistic. Extensions to other statistics and more general comparison problems would be pos-sible with minor or obvious modifications. We consider one-sided t-tests for selecting overexpressed genes for a particular phenotypic class. (See Section 5 for discussion of using two-sided tests.) A model for the t statistic is normally distributed expression levels with a common standard deviation between 2 classes.

For a DE gene, we consider the absolute standardized mean difference, that is the absolute mean difference between 2 classes divided by the common standard deviation, as effect size. Let n be the total

number of samples for microarray experiments, and let a1 and a2 be the allocation proportions for 2

phenotypic classes, so that n1= a1n, n2= a2n, and a1+ a2= 1. In sample size calculations, we regard

a1and a2as given or fixed quantities.

For a single DE gene with an effect size, the comparisonwise power 1 − β is a function of n, α, and

for given a1and a2, in the context of single two-sample t-test, which is expressed as

1− β = 1 − n₋₂(tn_−2,α|√na1a2), (2.3)

wheref(.|a) is the cumulative distribution function of a noncentral t-distribution with f degrees of

freedom and noncentrality parameter a, and tf,q is the upper q point of the central t-distribution with f

degrees of freedom (e.g. Chow and others, 2003, p 57).

Next, we consider a pair of DE genes—genes i and j say—with a correlation coefficient,ρ, in gene

expression. We assume that these genes have common effect size on the phenotype, which implies a

common comparisonwise power 1− β from (2.3). For gene i, let Si be the rejection status in multiple

testing, so that Si = 1 if gene i is rejected and Si = 0 otherwise. Sj is similarly defined for gene j . Note

that E(Si) = E(Sj) = 1 − β and Var(Si) = Var(Sj) = β(1 − β). The correlation between Si and Sj is

given by

θ = Corr(Si, Sj)

= [E(SiSj) − E(Si)E(Sj)]

Var(Si)Var(Sj)

(4)

= {Fn−2(tn−2,β, tn−2,β; ψ) − (1 − β)2}/β(1 − β)

≈ {Fn−2(tn−2,β, tn−2,β; ρ) − (1 − β)2}/β(1 − β), (2.4)

where Ff is a cumulative distribution function of a bivariate t-distribution with f degrees of freedom

under the specified correlation coefficient between the 2 t-distributed variables. Note that the last line in

(2.4) holds because the correlation coefficient between the t-statistics for gene i and j ,ψ = Corr(Ti, Tj),

converges to ρ for large sample sizes (Jung and others, 2005). Thus, the parameter θ in (2.4) can be

regarded as a function ofρ, β, and n, which implies a function of ρ, , α, and n from (2.3). Note also

that the expression in (2.4) is similar to equation 5 in Shao and Tseng (2007). They consider one-sided z-tests with normal approximation for large sample sizes, although they also consider an adaptation to two-sided t-test.

Here, we assume that there is a single gene block that is formed by m∗DE genes with a common

correlationρ in gene expression and common effect size, . Again, let S be the rejection status in multiple

testing for a particular gene in this gene block, so that S= 1 if the gene is rejected and S = 0 otherwise.

Once again, E(S) = 1−β and the correlation of S between genes is θ in (2.4) for all the genes in the gene

block. As the distribution of the sum of rejection status, U , for all m∗genes, we assume a beta-binomial

distribution, g(U = u) = m∗ u B(u + a, m∗_{− u + b)} B(a, b) , (2.5)

where B(a, b) = (a)(b)/ (a + b) using the gamma function , a = (1 − β)(1/θ − 1)(>0) and

b= β(1/θ − 1)(> 0) (Tsai and others, 2003, 2005; Shao and Tseng, 2007).

This is to assume that U has a binominal distribution Bin(m∗_{, p) for a block-specific mean p, which}

is a random variable from the beta distribution with density function πa−1(1 − π)b−1/B(a, b), where

π = E(S) = 1 − β (e.g. Williams, 1975). Although this assumption will restrict correlation parameter ρ to be positive within each gene block, it would be reasonable for one-sided tests in detecting DE genes

that change in the same direction (see Section 5). The distribution g(u) is a function of m∗_{, β, and θ, thus}

a function of m∗, , ρ, α, and n from (2.3) and (2.4). So, we denote g(u) by g(u|m∗, , ρ, α, n).

Finally, we assume that there are several different gene blocks. Specifically, we assume that m1DE

genes can be divided into L gene blocks. We assume that L gene blocks are mutually independent, which

yields a blockwise correlation structure for m1 DE genes. This structure is frequently used to model

co-regulated genes in genetic pathways and/or homologous genes (e.g. Shao and Tseng, 2007; Storey

and Tibshirani, 2003). We assume that gene block l comprises m(l) genes with a common correlation

coefficientρ₁(l)in gene expression and common effect size(l). We obtain 1− β(l)from (2.3) andθ(l)

from (2.4)(l = 1, . . . , L). From (2.5), the sum of rejection status for all m(l)₁ genes, U(l), will follow the

beta-binomial distribution, g(l)(U(l)= u|m(l)₁ , (l), ρ₁(l), α, n) = m(l) u B(u + a(l)_{, m}(l)_{− u + b}(l)₎ B(a(l)_{, b}(l)₎ , (2.6)

where a(l) = (1 − β(l))(1/θ(l)− 1) and b(l) = β(l)(1/θ(l)− 1). Then, the distribution of the sum of

true positives across L blocks, R1=lL₌₁U(l), in Table 1 follows the convolution of the beta-binomial

distribution (2.6), f1(R1= r| ˜m1, ˜, ˜ρ1, α, n) = (k1,...,kL)∈K (r) L l=1 g(l)(U(l)= kl|m(l)₁ , (l), ρ₁(l), α, n), (2.7)

(5)

where ˜m1 = (m(1)₁ , . . . , m(L)₁ ), ˜ = ((1), . . ., (L)), and ˜ρ1 = (ρ(1)₁ , . . ., ρ₁(L)). Here, the product

pertains to the assumption that gene blocks are independent, and K(r) in the summation is the set of all

possible combinations of counting numbers (k1, . . . , kL) that satisfy

L

l=1kl= r.

For non-DE genes, we can also assume similar blockwise correlation models. However because the

number of non-DE genes, m0, is expected to be very large compared with the number of DE genes, m1,

it may be rather difficult to specify plausible values for the number of blocks, block sizes, and correlation coefficients within each gene block. Meanwhile, it is generally expected that a large fraction of non-DE genes are independent or weakly correlated. Hence, as an approximation, we propose to assume

independence for non-DE genes—that isρ0= 0. Under this assumption, the distribution of the number of

false positives, R0, a sum of independent binary outcomes, can be modeled by the binomial distribution,

f0(R0= r|m0, α) = m0 r αr_{(1 − α)}m0−r_. _(2.8)

For this simple model, the distribution of R0depends on the parameters m0andα. We assume statistical

independence between DE genes and non-DE genes.

Now we describe the distribution of FDP in (2.1) and Se in (2.2) using the derived distributions (2.7)

for R1, and (2.8) for R0. The distribution of FDP depends on the parameters(m0, ˜m1, ˜, ˜ρ1, α, n), such

that

f(FDP = v|m0, ˜m1, ˜, ˜ρ1, α, n) =

r0,r1∈X (v)

f0(R0= r0|m0, α) f1(R1= r1| ˜m1, ˜, ˜ρ1, α, n), (2.9)

where X(v) is the set of all possible combinations of (r0, r1) that satisfy r0/(r0+r1) = v. The expectation

of FDP, that is FDR, is calculated as follows:

FDR= m0 r0=0 m1 r1=0 r0 r0+ r1 f0(R0= r0|m0, α) f1(R1= r1| ˜m1, ˜, ˜ρ1, α, n), (2.10)

where r0/(r0+ r1) = 0 when r0 = r1 = 0. Meanwhile, the distribution of Se is obtained as that of R1,

which depends on the parameters( ˜m1, ˜, ˜ρ1, α, n), such that

f(Se = v| ˜m1, ˜, ˜ρ1, α, n) = f1(R1= m1v| ˜m1, ˜, ˜ρ1, α, n). (2.11)

The expectation of Se is calculated as follows:

E(Se) = 1 m1 m1 r1=0 r1f1(R1= r1| ˜m1, ˜, ˜ρ1, α, n). (2.12)

The proposed procedure for sample size calculation considers the following 2 conditions (2.9) and (2.11) on the distribution of FDP and Se, respectively:

 = Pr(FDP d1|m0, ˜m1, ˜, ˜ρ1, α, n) c1 (2.13)

= Pr(Se d2| ˜m1, ˜, ˜ρ1, α, n) c2. (2.14)

Specifically, we can solve these equations for n andα for given parameters (m0, ˜m1, ˜, ˜ρ1), where d1, d2,

c1, and c2are prespecified bounds. Shao and Tseng (2007) called overall power. An algorithm to obtain

the solution for n andα is provided in section A of the supplementary material available at Biostatistics

online (http://biostatistics.oxfordjournals.org).

(6)

When both selections, that is selection of overexpressed genes and selection of underexpressed genes for a particular phenotypic class, are planned at the analysis stage, we calculate sample size based on one-sided multiple tests separately for each selection and take the maximum of the calculated sample sizes, which may generally yield a conservative design for one of the 2 selections. Note that for the same specification of design parameters with the same effect sizes, the sample size estimates will be identical for both selections. See Section 5 for discussion on basing on two-sided tests.

2.3 Specification of design parameters

Relevant data from a pilot study or earlier experiments for a similar population of samples are essential for specifying gene blocks and plausible values of design parameters. We can estimate the number of

non-DE genes, m0, using relevant data (e.g. Schweder and Spjøtvoll, 1982; Benjamini and Hochberg, 2000;

Storey and Tibshirani, 2003). A simple procedure to identify gene blocks is the application of hierarchical clustering of genes. The gene set used for clustering can be restricted to genes associated with classes

(e.g. the top m− ˆm0genes with the greatest t-statistics) to identify gene blocks for DE genes. For a given

cutoff on the specified distance metric in hierarchical clustering, we can identify gene blocks and obtain

the values for the number of blocks L and the sizes of each block ˜m1. One very informal procedure to

identify clusters, which is often used, is simply to examine the dendrogram, looking for large changes in level. More formal procedures for the number of clusters problem are available (see Milligan and Copper, 1985, for an extensive review).

The average correlation coefficients of gene expression within each gene block can be used as

esti-mates for the common correlation coefficients in gene expression within each block, ˜ρ1(Tsai and others,

2005). Similarly, average effect sizes within each block can be used as estimates for the common effect

sizes within each block, ˜. It is advisable to consider different cutoffs in hierarchical clustering and assess

the impact of changing the cutoff on sample size estimates, as illustrated in Section 4.

In specifying effect sizes, one should consider a possible difference in the precision of microarray data between the earlier experiments and the current experiment one is planning. For given samples, the difference can be caused by possible differences between earlier experiments and the current experiment in microarray platforms, experimental procedures, and conditions, such as those in sample preparation, RNA labeling, and hybridization. If one can specify the ratio of the standard deviation of expression data in the current experiment to that in an earlier experiment for given samples—γ say—one may multiply

the estimated effect sizes from the previous experiment by 1/γ , as the estimated effect sizes for sample

size calculation for the current experiment.

3. NUMERICALRESULTS

We illustrate the proposed sample size calculation based on criteria (2.13) and (2.14), with comparisons with those based on the expectation of FDP, that is FDR, in (2.10) and that of Se in (2.12),

FDR d1, (3.1)

E(Se) d2. (3.2)

Specifically, we consider the following 3 criteria:

Criteria 1: Pr(FDP 0.1) 0.95 and Pr(Se 0.8) 0.9, as (2.13) and (2.14)

Criteria 2: FDR 0.1 and Pr(Se 0.8) 0.9, as (3.1) and (2.14)

Criteria 3: FDR 0.1 and E(Se) 0.8, as (3.1) and (3.2).

(7)

Criteria 1 correspond to the proposed sample size calculation to control actual FDP and Se. Criteria 2 are similar to those used by Shao and Tseng (2007) and Criteria 3 are similar to those used by Jung (2005).

For DE genes, we assumed blockwise correlation structures with L = 1, 2, or 5 blocks with equal

sizes. We assumed a common correlation coefficientρ(l)= ρ and a common effect size (l) = across

blocks (l = 1, . . . , L). Table 2 summarizes the sample size estimates for each of several configurations

of the parameters under equal sample sizes between 2 classes (a1 = a2 = 1/2). As one expects,

sam-ple size estimates based on Criteria 1 were greater than those based on the other criteria based on the expectation of FDP. It is quite surprising that the increment by controlling actual FDP compared with controlling FDR (i.e. Criteria 1 versus Criteria 2 in Table 2) was at most 14% for sample sizes of 50

Table 2. Sample size estimates for the 3 criteria, regarding false positives and true positives

m1 ρ L m= 2000 m= 5000

Criteria 1† Criteria 2‡ Criteria 3§ Criteria 1† Criteria 2‡ Criteria 3§

50 0.58 0.0 1 204 179 162 232 207 188 0.4 2 235 208 163 264 238 189 5 220 194 162 248 223 188 0.7 2 252 225 164 282 255 190 5 230 205 162 260 233 189 1.00 0.0 1 73 64 58 83 74 68 0.4 2 84 75 59 95 85 68 5 79 70 58 89 80 68 0.7 2 89 79 59 100 90 68 5 83 73 58 93 84 68 2.00 0.0 1 22 19 18 25 22 21 0.4 2 25 23 18 28 26 21 5 24 20 18 27 24 21 0.7 2 26 24 18 30 27 21 5 24 22 18 27 25 21 100 0.58 0.0 1 171 154 142 200 182 169 0.4 2 204 185 143 235 216 169 5 189 171 142 219 201 169 0.7 2 222 202 144 252 234 171 5 200 182 142 230 212 169 1.00 0.0 1 61 55 51 71 65 61 0.4 2 73 66 51 84 77 61 5 68 62 51 79 72 61 0.7 2 78 72 52 89 82 61 5 72 64 51 83 76 61 2.00 0.0 1 18 17 16 21 20 19 0.4 2 22 20 16 25 23 19 5 20 18 16 24 22 19 0.7 2 23 21 16 27 24 19 5 21 19 16 24 23 19

†_{Criteria 1, Pr(FDP}_{0.1) 0.95 and Pr(Se 0.8) 0.9, as criteria (2.13) and (2.14).} ‡_{Criteria 2, FDR}_{0.1 and Pr(Se 0.8) 0.9, as criteria (3.1) and (2.14).}

§_{Criteria 3, FDR}_{0.1 and E(Se) 0.8, as criteria (3.1) and (3.2).}

(8)

Fig. 1. Contour plots show how the sample size estimates change for various lower bounds c1and c2in the criteria

(2.13) and (2.14) (left panel) and for various numbers of blocks L and common correlation coefficientsρ₁for DE genes within gene block (right panel). We set m= 2000, m1= 50 and = 1.0 for both panels. We set ρ1= 0.4 and

L= 5 for the left panel, while c₁= 0.95 and c₂= 0.9 for the right panel.

or more. A simulation study demonstrated that sample size estimates based on Criteria 1 certainly could control Criteria 1, while sample size estimates based on Criteria 2 and 3 yielded underpowered designs that could not control Criteria 1 (see section B of the supplementary material available at Biostatistics online [http://biostatistics.oxfordjournals.org]).

Figure 1 provides 2 graphs to illustrate how sample size estimates change, depending on design parameters. As expected, the sample size estimates increase for larger (i.e. more stringent) thresholds

c1and c2in (2.13) and (2.14) (the left panel of Figure 1). The sample size estimates decrease for smaller

correlation,ρ1, between DE genes within a gene block, or for larger number of gene blocks, L (the right

panel of Figure 1). Note that smallerρ1or larger L represents the convergence of the correlation matrix

for all the genes from all blocks to the identity matrix.

4. LYMPHOMAEXAMPLE

We illustrate the proposed sample size calculation with the estimation of design parameters, as described in Section 2.3, using gene expression data from a lymphoma study of Rosenwald and others (2002). They correlated gene expression data on 7399 genes from cDNA microarrays using pretreatment biopsy speci-mens with survival time after chemotherapy for 240 diffuse large B-cell lymphoma patients. We supposed the situation where we design a similar microarray experiment for a similar population of patients using the data set from this earlier experiment. We considered a comparison of 2 classes: 2-year vital and death. We supposed that similar numbers of genes are investigated in the current microarray experiment we are

planning and set the number of genes to be 7399 (m = 7399). Here, we report sample size calculations

for selecting overexpressed genes for 2-year death. We estimated the proportion of the number of DE genes using the procedure of Storey and Tibshirani (2003). The estimated proportion of DE genes was

0.2; therefore, the estimated number of DE genes was 1480 (=m × 0.2). Note that DE genes include both

overexpressed and underexpressed genes for 2-year death. We assumed that the number of overexpressed

genes and that of underexpressed genes are nearly identical and set m1to be 100, 400, or 800 to cover

740 (=1480/2). We assumed blockwise correlation structures for m1DE genes. To identify gene blocks

using the data set, we performed a hierarchical clustering of the top m1genes with the most significance

in the one-sided multiple t-tests using the distance metric of one minus Spearman correlation coefficient

(9)

Table 3. Sample size estimates to control Pr(FDP 0.1) 0.95 and Pr(Se 0.8) 0.9, as criteria (2.13) and (2.14), for several configurations of design parameters for the lymphoma example m1 L Block no Block sizes Average correlations Average effect sizes Sample size estimates

γ = 1.00 γ = 0.67 100 2 1 15 0.15 0.46 460 207 2 85 0.41 0.43 5 1 7 0.23 0.47 445 202 2 59 0.43 0.43 3 26 0.47 0.44 4 6 0.22 0.44 5 2 0.30 0.44 400 2 1 383 0.30 0.36 499 225 2 17 0.19 0.42 5 1 47 0.17 0.37 470 212 2 109 0.35 0.37 3 176 0.49 0.36 4 51 0.21 0.36 5 17 0.19 0.42 800 2 1 612 0.31 0.32 540 242 2 188 0.13 0.32 5 1 119 0.21 0.31 532 238 2 493 0.39 0.32 3 125 0.21 0.31 4 38 0.17 0.35 5 25 0.10 0.34

and complete linkage. We identified 2 or 5 clusters with different sizes as blocks (i.e. L = 2 or 5) for

2 different cutoff points in the distance metric, as summarized in Table 3. We calculated the average cor-relation coefficients and the average of standardized mean differences between classes within block as estimates of a common correlation coefficient and a common effect size within block, respectively. We

also introduced the parameter γ , introduced in Section 2.3, which represents the ratio of the standard

deviation of expression data in the current experiment to that in the earlier experiment of Rosenwald and others (2002) for given samples. The effect sizes for the current experiment were specified to be the estimated effect sizes from the earlier experiment, multiplied by 1/γ .

Sample size estimates for several configurations are given in Table 3. The impact of changing the

num-ber of gene blocks was generally small. Forγ = 1.0, 445 or more samples are needed—which is probably

impractical. On the other hand, for γ = 0.67 (=1.0/1.5), the required sample size reduced to 242 or

less. These results indicate that substantial reduction in the variability of expression data (e.g. 67% or more) is warranted to have realistically acceptable sample sizes for the current microarray experi-ment. A simulation study demonstrated the adequacy of assuming the gene block structure for DE genes (see section B of the supplementary material available at Biostatistics online [http://biostatistics. oxfordjournals.org]).

(10)

5. DISCUSSION

Nowadays, genome-wide microarrays are widely used in biological and clinical research. More attention should be paid to the design of microarray experiments including sample size calculations as well as to the analysis of microarray data. Almost all previous procedures for sample size calculation to detect differen-tial genes are to control the expectation of FDP, that is FDR because the control of FDR is so common in the analysis of microarray data. To cope with possible variability in FDP and the number of true positives or sensitivity Se, we developed a procedure for sample size calculation to control the distributions of FDP and Se simultaneously under blockwise correlation structures among genes. The increment in required sample size by controlling actual FDP compared with controlling FDR, ranged from 10 to 20%, under the settings we considered in Section 3, which would be acceptable in many cases. Because the block sizes, correlation coefficients, and effect sizes within each block can vary across blocks, our procedure would be generally applicable for controlling the actual numbers of false positives and true positives in multiple testing.

Although we can adapt our sample size calculation for two-sided multiple tests for selecting both over-expressed and underover-expressed genes, we propose to base on one-sided tests. One-sided tests are to detect DE genes that change in the same direction. As Matsui and others (2008) argued that when restricting to DE genes that change in the same direction, the genes would tend to have positive correlations. For the lymphoma data set in Section 4, the observed correlation coefficients among the top 100 genes with the

most significance in the one-sided multiple tests ranged from −0.27 to 0.98, with a median of 0.35;

the proportion of negative correlations was only 5%. Note that although negative correlations are almost as likely as positive correlations—because a group of genes may inhibit or switch off other genes in the same molecular pathway—many of such genes would change in the opposite direction, which can be detected by reversed one-sided tests. Without such restriction on the direction of differential expression, as intended by two-sided tests, it may be difficult to incorporate possible negative correlations among blocks of positively correlated DE genes in sample size calculations. On the other hand, by invoking the restriction on the direction of differential expression, as intended by the use of one-sided tests, we could expect several independent blocks of positively correlated DE genes to exist. Sensitivity analyses using different cutoffs in hierarchical clustering could capture a correct specification with independent blocks of positively correlated DE genes.

An interesting extension of our procedure is to incorporate the empirical null distribution introduced by Efron (2004). Specifically, we estimate the empirical null distribution using relevant data sets from pilot studies or earlier experiments and derive an empirical distribution for the number of false positives,

R0. The evaluation of this extension, including the comparison with the use of the theoretical distribution

in (2.8), is a subject of future research.

ACKNOWLEDGMENTS

We are grateful to the anonymous reviewers, associate editor, and coeditor for helpful comments. Conflict of Interest: None declared.

FUNDING

Grant-in-Aid for Scientific Research (20590599) from the Ministry of Education, Culture, Sports, Science and Technology of Japan.

SUPPLEMENTARY MATERIAL

Supplementary Material is available at http://www.biostatistics.oxfordjournals.org.

(11)

REFERENCES

BENJAMINI, Y.ANDHOCHBERG, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society, Series B 57, 289–300.

BENJAMINI, Y.ANDHOCHBERG, Y. (2000). On the adaptive control of the false discovery rate in multiple testing with independent statistics. Journal of Educational and Behavioral Statistics 25, 60–83.

CHOW, S. C., SHAOJ.ANDWANGH. (2003). Sample Size Calculations in Clinical Research. New York: Marcel Dekker.

EFRON, B. (2004). Large-scale simultaneous hypothesis testing: the choice of a null hypothesis. Journal of the American Statistical Association 99, 96–104.

GENOVESE, C. R.AND WASSERMAN, L. (2004). A stochastic process approach to false discovery control. The Annals of Statistics 32, 1035–1061.

GENOVESE, C. R.ANDWASSERMAN, L. (2006). Exceedance control of the false discovery proportion. Journal of the American Statistical Society 101, 1408–1417.

JUNG, S. H. (2005). Sample size for FDR-control in microarray data analysis. Bioinformatics 21, 3097–3104. JUNG, S. H., BANG, H. ANDYOUNG, S. (2005). Sample size calculation for multiple testing in microarray data

analysis. Biostatistics 6, 157–169.

KORN, E. L., LI, M. C., MCSHANE, L. M.ANDSIMON, R. (2007). An investigation of two multivariate permuta-tion methods for controlling the false discovery proporpermuta-tion. Statistics in Medicine 26, 4428–4440.

KORN, E. L., TROENDLE, J. F., MCSHANE, L. M. AND SIMON, R. (2004). Controlling the number of false discoveries: application to high-dimensional genomic data. Journal of Statistical Planning and Inference 124, 379–398.

MATSUI, S., ZENG, S., YAMANAKA, T.ANDSHAUGHNESSY, J. (2008). Sample size calculations based on ranking and selection in microarray experiments. Biometrics 64, 217–226.

MILLIGAN, G. W.ANDCOOPER, M. C. (1985). An examination of procedures for determining the number of clusters in a data set. Psychometrika 50, 159–179.

PAWITAN, Y., CALZA, S.ANDPLONER, A. (2006). Estimation of false discovery proportion under general depen-dence. Bioinformatics 22, 3025–3031.

PAWITAN, Y., MICHIELS, S., KOSCIELNY, S., GUSNANTO, A.AND PLONER, A. (2005). False discovery rate, sensitivity and sample size for microarray studies. Bioinformatics 21, 3017–3024.

ROSENWALD, A., WRIGHT, G., CHAN, W. C., CONNORS, J. M., CAMPO, E., FISHER, R. I., GASCOYNE, R. D., MULLER-HERMELINK, H. K., SMELAND, E. B., STAUDT, L. M. and others (2002). The use of molecular profiling to predict survival after chemotherapy for diffuse large B-cell lymphoma. The New England Journal of Medicine 346, 1937–1947.

SCHWEDER, T.ANDSPJØTVOLL, E. (1982). Plots of p-values to evaluate many tests simultaneously. Biometrika 69, 493–502.

SHAO, Y. AND TSENG, C. H. (2007). Sample size calculation with dependence adjustment for FDR-control in microarray studies. Statistics in Medicine 26, 4219–4237.

STOREY, J. D. (2002). A direct approach to false discovery rates. Journal of the Royal Statistical Society, Series B 64, 479–498.

STOREY, J. D. ANDTIBSHIRANI, R. (2003). Statistical significance for genomewide studies. Proceedings of the National Academy of Sciences of the United States of America 100, 9440–9445.

TSAI, C. A., HSUEHH. M.ANDCHEN, J. J. (2003). Estimation of false discovery rates in multiple testing: appli-cation to gene microarray data. Biometrics 59, 1071–1081.

(12)

TSAI, C. A., WANG, S. J., CHEN, D. T. ANDCHEN, J. J. (2005). Sample size for gene expression microarray experiments. Bioinformatics 21, 1502–1508.

WILLIAMS, D. A. (1975). The analysis of binary responses from toxicological experiments involving reproduction and teratogenicity. Biometrics 31, 949–952.

[Received November 20, 2008; revised May 20, 2009; accepted for publication June 22, 2009]