3.3 Results
4.2.2 Subnetwork testing
4.2.2.1 Scoring
The subnetwork scores in SNet (Soh et al., 2011) and PFSNet (Lim and Wong, 2014) are assigned to patients in each phenotype D and ¬D. For example, a disease-related subnetwork might have a mean value of 15 in phenotype D and 51 in ¬D. In datasets with large sample sizes, class-label permutations are used to test whether this difference is significant. In datasets with extremely small sample size, this is not possible, and one has to resort to the theoretical t-distribution for this test. But, in this situation, a small change in sample size can produce dramatically different outcomes. For example, a t-test between two groups with two samples each—say 10, 20 and 50, 52—has a p-value of 0.077, whereas with a few more samples 9,10,20,21 and 49,50,52,53, the p-value drops to 0.0008.
Our subnetwork score is based on a novel idea. We postulate that, when a subnetwork is irrelevant to the distinction between phenotypesD and ¬D, the difference of the expres- sion values of any gene in this subnetwork in any pair of samples of D and ¬D should be very small. Suppose there arek genes in a subnetwork, m patients in phenotype D and n patients in phenotype ¬D. Then there are m ∗ n possible pairs of differences for each of the k genes. According to the postulate, if the subnetwork is irrelevant, these M = k ∗ m ∗ n paired differences should be distributed around 0. Thus we propose to evaluate these paired differences using a paired t-test.
Letδ(gi, pj, p0l) =e(gi, pj) −e(gi, p0l) be the distribution of paired expression differences
for eachpj in D,p0lin ¬D and giin subnetworkS, where e(a, b) represents the expression
value of genea in patient b. Then the t-statistic computed for the subnetwork S is:
TS =
µδ
sdδ/
√
M (4.2)
where µδ and sdδ are the average and standard deviation over all paired differences of
genes in the subnetwork S respectively and M = k ∗ m ∗ n.
Returning to our example of using two samples per group, although we have only 2 samples per phenotype, we can have up to 4 ∗k paired differences if k is the size of the subnetwork.
It is also possible to define a similar distribution based on the rank differences of the genes in subnetworkS; i.e. δ0(g
i, pj, p0l) =r(gi, pj) −r(gi, p0l) wherer represents the rank
function defined in equation 4.1. The t-statistic computed for the paired rank difference is analogously defined to that of the paired expression differences in equation 4.2.
The choice of statistical test depends on the assumptions that govern the dataset. The t-test is a parametric test that assumes data normality. If the number of samples and the
size of subnetworks are very small, it is hard to verify this normality assumption. Hence it might be necessary to consider an alternative test statistic that does not assume data normality. In our software, we provide the option of specifying a Wilcoxon-like test. However, in our experiments, the two tests actually produce very similar results.
The Wilcoxon-like test statistic for subnetwork S is computed by
WS = k X i=1 n X j=1 m X l=1 sign(δ(gi, pj, p0l)).Ri,j,l (4.3)
where sign is a function that maps positive numbers to 1 and negative numbers to −1 and Ri,j,l is the rank of the absolute value of the differences over all genes in the subnetwork
S in increasing order.
The Wilcoxon-like test has a similar null hypothesis to the t-test without restricting to a normal distribution and tests if the differences have a median 0.
4.2.2.2 Estimating the null distribution
Our conjecture that δ(gi, pj, p0l) is a distribution around 0 can be tested on a null dis-
tribution. There are two traditional ways that estimates this null distribution, in which randomized columns or rows of the expression matrix is used to re-compute the statistic over a number of iterations.
The first way assumes the null hypothesis that the subnetwork being tested is irrelevant to distinguishing the two phenotypes. Thus the gene-expression profiles of any pair of patients from the two phenotypes are exchangeable for computing points in the null distribution. In other words, class labels are randomly swapped to create new data inputs from which the null distribution is formed. This method is used by GSEA to
evaluate the significance of the Kolmogorov-Smirnov-like statistic of the pathway when sample size is sufficiently large. This method preserves the full gene-gene correlations, as well as gene-expression values, in each patient. However, when sample size is small there are limited ways in which the class labels can be permuted, resulting in a sparse null distribution. This greatly affects the reliability and the graularity of the p-values.
The second way postulates that any two gene-expression values within the same patient are exchangeable to compute the null distribution. This method creates new data inputs by randomly re-labeling genes. This method is used by GSEA to evaluate the significance of the Kolmogorov-Smirnov-like statistic of the pathway when the dataset has a small sample size, since a sizeable null distribution can be generated this way. However, this postulate is based on the assumption that the genes’ expression are independent of each other, ignoring the correlation between genes. In other words, this method actually tests if the genes in the pathway behave no differently from a random set of genes. But the genes in any pathway are coordinated by nature, whereas a random set of genes is not. Hence this null hypothesis is not appropriate. So it has a tendency of being rejected, producing false positives.
We rely instead on a third way to produce the null distribution for our test. It postu- lates that randomized gene-expression profiles that preserve the gene-gene correlation structure in the original dataset are exchangeable with it. This postulate is consistent with the assumption that genes in any pathway are coordinated and gene expressions are governed by biological pathways. Due to exchangeability that follows from the pos- tulate, it is statistically sound to use correlation-preserving randomized gene-expression profiles to obtain a null distribution of the test statistic. As mentioned in chapter 2, array rotation (Dorum et al., 2009) is one of the known techniques for producing a large number of these correlation-preserving randomized gene-expression profiles. We use this technique to produce statistically-valid p-values for our test statistic.