Supplementary Material for Chapter 3 - Sensitivity And Specificity Of Gene Set Analysis

Table B.1: Kruskal-Wallis test results show that there is a statistically significant difference between the reproducibility of gene set analysis methods across sample sizes for all three original datasets.

Method GSE53757 GSE13355 GSE10334

FRY 6.28e-13 2.99e-26 2.60e-13

GSEA-S 1.07e-11 1.45e-15 4.72e-05 GSEA-G 5.99e-12 6.72e-19 3.58e-13

ORA 5.03e-18 2.39e-19 1.67e-14

Camera 1.81e-19 2.11e-20 9.74e-05 ssGSEA 4.71e-25 8.70e-26 1.87e-26

PAGE 1.81e-16 5.26e-20 1.50e-10

GSVA 5.30e-20 7.17e-27 4.21e-07

PLAGE 2.10e-05 4.88e-06 4.89e-03 ROAST 1.37e-14 1.80e-26 1.10e-13

GAGE 2.34e-28 3.37e-28 4.51e-27

GlobalTest 6.73e-21 7.29e-25 2.58e-16 PADOG 7.10e-06 1.79e-17 7.18e-01

−80

−40 0 40 80

−80 −40 0 40 80

Case Control

Figure B.1: MDS plot showing the relation between samples in dataset GSE53757. Each sample is represented as a point on the plot. The control samples are coloured in dark red and the case samples are coloured in blue.

−80

−40 0 40 80

−80 −40 0 40 80

Case Control

Figure B.2: MDS plot showing the relation between samples in dataset GSE10334. Each sample is represented as a point on the plot. The control samples are coloured in dark red and the case samples are coloured in blue.

−80

−40 0 40 80

−80 −40 0 40 80

Case Control

Figure B.3: MDS plot showing the relation between samples in dataset GSE13355. Each sample is represented as a point on the plot. The control samples are coloured in dark red and the case samples are coloured in blue.

0.00 0.25 0.50 0.75 1.00

Overlap

0.00 0.25 0.50 0.75 1.00

Overlap

Figure B.4: Pine plots for dataset GSE53757 showing reproducibility of the results from ROAST (left) and FRY (right) across sample sizes. Reproducibility is quantified by overlap score (Equation 4.2).

Each layer of the pine plot illustrates the overlap score of the results of a method for 10 replicate datasets with the same sample size. From top to bottom, the pine plot shows replicates with sample size 2 × 20, 2 × 15, 2 × 10, 2 × 5, and 2 × 3. The overlap score ranges from 0 to 1 represented by a gradient from blue to red, respectively, separated by yellow in the middle (overlap of 0.5).

0.00 0.25 0.50 0.75 1.00

Overlap

0.00 0.25 0.50 0.75 1.00

Overlap

Figure B.5: Pine plots for dataset GSE53757 showing reproducibility of the results from Camera (left) and PADOG (right) across sample sizes. Reproducibility is quantified by overlap score (Equation 4.2).

0.00 0.25 0.50 0.75 1.00

Overlap

0.00 0.25 0.50 0.75 1.00

Overlap

Figure B.6: Pine plots for dataset GSE53757 showing reproducibility of the results from PAGE (left) and GSVA (right) across sample sizes. Reproducibility is quantified by overlap score (Equation 4.2).

0.00 0.25 0.50 0.75 1.00

Overlap

0.00 0.25 0.50 0.75 1.00

Overlap

Figure B.7: Pine plots for dataset GSE53757 showing reproducibility of the results from PLAGE (left) and GlobalTest (right) across sample sizes. Reproducibility is quantified by overlap score (Equa-tion 4.2). Each layer of the pine plot illustrates the overlap score of the results of a method for 10 replicate datasets with the same sample size. From top to bottom, the pine plot shows replicates with sample size 2 × 20, 2 × 15, 2 × 10, 2 × 5, and 2 × 3. The overlap score ranges from 0 to 1 represented by a gradient from blue to red, respectively, separated by yellow in the middle (overlap of 0.5).

0.00 0.25 0.50 0.75 1.00

Overlap

Figure B.8: Pine plots for dataset GSE53757 showing reproducibility of the results from ssGSEA across sample sizes. Reproducibility is quantified by overlap score (Equation 4.2). Each layer of the pine plot illustrates the overlap score of the results of a method for 10 replicate datasets with the same sample size. From top to bottom, the pine plot shows replicates with sample size 2 × 20, 2 × 15, 2 × 10, 2 × 5, and 2 × 3. The overlap score ranges from 0 to 1 represented by a gradient from blue to red, respectively, separated by yellow in the middle (overlap of 0.5).

●

Figure B.9: Box plots showing the distribution of overlap scores resulting from gene set analysis using FRY when using the original dataset GSE53757 for generating replicate datasets. The panel on the left shows the overlap scores from replicate datasets, while that on the right depicts the overlap scores of each replicate dataset and the whole dataset. See Figure 3.3 caption for more information.

●

Figure B.10: Box plots showing the distribution of overlap scores resulting from gene set analysis using Camera when using the original dataset GSE53757 for generating replicate datasets. The panel on the left shows the overlap scores from replicate datasets, while that on the right depicts the overlap scores of each replicate dataset and the whole dataset. See Figure 3.3 caption for more information.

●

Figure B.11: Box plots showing the distribution of overlap scores resulting from gene set analysis using ssGSEA when using the original dataset GSE53757 for generating replicate datasets. The panel on the left shows the overlap scores from replicate datasets, while that on the right depicts the overlap scores of each replicate dataset and the whole dataset. See Figure 3.3 caption for more information.

●

Figure B.12: Box plots showing the distribution of overlap scores resulting from gene set analysis using PAGE when using the original dataset GSE53757 for generating replicate datasets. The panel on the left shows the overlap scores from replicate datasets, while that on the right depicts the overlap scores of each replicate dataset and the whole dataset. See Figure 3.3 caption for more information.

●

Figure B.13: Box plots showing the distribution of overlap scores resulting from gene set analysis using GSVA when using the original dataset GSE53757 for generating replicate datasets. The panel on the left shows the overlap scores from replicate datasets, while that on the right depicts the overlap scores of each replicate dataset and the whole dataset. See Figure 3.3 caption for more information.

●●

Figure B.14: Box plots showing the distribution of overlap scores resulting from gene set analysis using PLAGE when using the original dataset GSE53757 for generating replicate datasets. The panel on the left shows the overlap scores from replicate datasets, while that on the right depicts the overlap scores of each replicate dataset and the whole dataset. See Figure 3.3 caption for more information.

●

Figure B.15: Box plots showing the distribution of overlap scores resulting from gene set analysis using ROAST when using the original dataset GSE53757 for generating replicate datasets. The panel on the left shows the overlap scores from replicate datasets, while that on the right depicts the overlap scores of each replicate dataset and the whole dataset. See Figure 3.3 caption for more information.

●

Figure B.16: Box plots showing the distribution of overlap scores resulting from gene set analysis using GlobalTest when using the original dataset GSE53757 for generating replicate datasets. The panel on the left shows the overlap scores from replicate datasets, while that on the right depicts the overlap scores of each replicate dataset and the whole dataset. See Figure 3.3 caption for more information.

●

Figure B.17: Box plots showing the distribution of overlap scores resulting from gene set analysis using PADOG when using the original dataset GSE53757 for generating replicate datasets. The panel on the left shows the overlap scores from replicate datasets, while that on the right depicts the overlap scores of each replicate dataset and the whole dataset. See Figure 3.3 caption for more information.

●

Figure B.18: Box plots showing the distribution of overlap scores resulting from gene set analysis using GSEA-G when using the original dataset GSE53757 for generating replicate datasets. The panel on the left shows the overlap scores from replicate datasets, while that on the right depicts the overlap scores of each replicate dataset and the whole dataset. See Figure 3.3 caption for more information.

● ●

Figure B.19: Kendall’s coefficient of concordance for each method under study when using the original dataset GSE10334 for generating replicate datasets. The x-axis shows the sample size. The y-axis shows concordance coefficients of the results of gene set analysis of 10 replicate datasets of the same size.

Figure B.20: Kendall’s coefficient of concordance for each method under study when using the original dataset GSE13355 for generating replicate datasets. The x-axis shows the sample size. The y-axis shows concordance coefficients of the results of gene set analysis of 10 replicate datasets of the same size.

● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

Number of Gene Sets Predicted as Differentially Enriched

●

Figure B.21: The number of gene sets predicted as differentially enriched for each method under study when using the original dataset GSE10334 for generating replicate datasets. The x-axis shows the sample size per group. The y-axis shows the average number of gene sets predicted as differentially enriched across 10 replicate datasets of the same size. The red line parallel to the x-axis shows the size of the gene set database being used, i.e. the maximum possible number of gene sets that could be predicted as being differentially enriched.

●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

Number of Gene Sets Predicted as Differentially Enriched

●

Figure B.22: The number of gene sets predicted as differentially enriched for each method under study when using the original dataset GSE13355 for generating replicate datasets. The x-axis shows the sample size per group. The y-axis shows the average number of gene sets predicted as differentially enriched across 10 replicate datasets of the same size. The red line parallel to the x-axis shows the size of the gene set database being used, i.e. the maximum possible number of gene sets that could be predicted as being differentially enriched.

TableB.2:Average(µ)andstandarddeviation(σ)ofthenumberofdifferentiallyenrichedgenesetsreportedbyeachmethodforcontrol- controlexperimentwhenusingtheoriginaldatasetGSE53757forgeneratingreplicatedatasets.Sincebothphenotypeshavebeenrandomly chosenfromcontrolsamplesofarealdataset(GSE53757),nodifferentiallyenrichedgenesetisexpected.Thereportedgenesetsareconsidered asfalsepositives.Methodswithalargenumberofreportedgenesetssufferfromalackofspecificity. Samplesizepergroup34567891011121314151617181920 FRYµ0.10.10.00.00.00.00.00.00.00.00.10.00.00.00.00.00.00.0 σ0.30.30.00.00.00.00.00.00.00.00.30.00.00.00.00.00.00.0 GSEA-Sµ0.00.00.00.017.26.510.213.116.17.014.010.416.411.19.510.311.39.9 σ0.00.00.00.018.54.28.812.814.94.416.89.615.96.37.87.47.411.9 GSEA-Gµ621.8578.1596.6555.8609.1559.5677.8792.0884.5789.0720.6511.8639.9556.5565.6758.0707.5425.5 σ463.4580.2454.5450.4441.4476.2420.7472.5526.7528.7442.4348.0531.0471.6441.3655.6326.1338.3 ORAµ0.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.0 σ0.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.0 Cameraµ0.00.00.80.20.00.338.228.354.118.567.038.8102.255.562.572.955.3129.3 σ0.00.02.40.40.00.5107.659.983.218.580.729.998.264.661.455.134.878.4 ssGSEAµ5157.75174.85185.05220.15219.15231.95233.55249.35249.25256.55251.95261.55264.15271.05278.25276.95282.55283.0 σ1712.01717.01720.31731.61731.11735.41735.91741.11741.01743.41741.91745.01745.91748.31750.61750.21752.11752.2 PAGEµ1189.51071.51082.91095.21155.41206.61246.01343.01169.71225.21221.61020.31228.21139.41115.61128.41208.51055.5 σ426.8509.0472.1424.1417.4456.1473.4477.4530.3516.4467.7409.7538.5412.9546.4472.9471.7473.0 GSVAµ0.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.0 σ0.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.0 PLAGEµ0.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.0 σ0.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.0 ROASTµ0.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.0 σ0.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.0 GAGEµ941.91291.31472.21718.01979.42073.32214.62317.62373.82452.02540.72553.82632.52647.82683.52741.02746.32779.1 σ364.0488.9510.9592.4665.2698.2740.2772.1792.0824.3847.1848.3874.6880.6892.6910.4911.9923.5 GlobalTestµ0.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.0 σ0.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.0 PADOGµ264.878.624.76.37.36.27.15.04.14.46.56.36.67.64.23.86.75.7 σ106.637.512.14.94.56.07.73.34.94.35.66.65.35.74.55.24.64.8

Appendix C

In document Sensitivity And Specificity Of Gene Set Analysis (Page 134-150)