Statistical analysis - – MATERIALS AND METHODS

CHAPTER 3 – MATERIALS AND METHODS

3.5. Statistical analysis

Descriptive statistics were performed to characterize the cancer cohorts. Inferential statistical analysis was executed to capture populational significant results regarding gene expression and DNA methylation, discriminated by solid tissue normal and stage I primary tumor. Before any comparative analyses (normal vs tumor), was performed a normality test to verify normal distribution. Results were considered statistically significant when p-value< 0.05, assuming the False Discovery Rate (FDR) correction < 0.05.

3.5.1. Shapiro–Wilk normality test

Shapiro-Wilk test was used to test the normal distribution. This procedure is important since the interpretation and statistical inference procedures must to be sustained. Shapiro–Wilk normality test represents the most powerful test in order to attest distribution behaviour considering sample size effect (Razali & Wah 2011). Parametric or non-parametric procedures were conducted based on the Shapiro-Wilk test results. A p-value<0,05 lidded to the rejection

Figure 3.5 - Duplicate cases function. This function identifies duplicate samples IDs and collects those data frames by performing median for be transformed into single data frame.

39 of the null hypothesis, a sample came from a normally distributed population (Shapiro & Wilk 1965), and therefore we opted for a non-parametric tests version.

This test was performed using shapiro.test function available in stats R package.

3.5.2. Levene test

Groups variance is a very important assumption in statistical procedures, especially when we are testing, equal means by groups. Therefore, we performed Levene's test for homogeneity of variance across two samples (Levene 1960). The Levene's test results determined if we opted for Welch test (unequal variances t-test), which is more reliable in this context, or for the Student´s t-test (equal variances t-test).

This test was performed using leveneTest function available in car R package.

3.5.3. Unpaired two-sample statistical tests

Wilcoxon Rank Sum test (Wilcoxon 1945) or Mann Whitney U test, a non-parametric alternative to Student´s t-test was used to compare two unpaired samples. This test takes into account the ranking differences between the two samples, and tests if the distributions of both populations are equal, or if they were independently selected from populations with same distribution.

Student´s t-test is a parametric test used for testing equal populational means, in the presence of independent samples normally distributed, and assuming similar variance of groups. For unequal variances of groups we performed the Welch test (Welch 1947).

These tests were performed using wilcox.test and t.test function available in stats R package.

3.5.4. Correction of multiple testing

Multiple comparisons increase the probability of taking false positives. An approach to controlling the false discovery rate (FDR) was proposed by Benjamini & Hochberg. This approach takes in account the order of p-values, considering the minimum accumulative to

40 verify the criteria _{ } q i m

m i p_i  , 1,...,     

 (Benjamini & Hochberg 1995). This condition provides a new set of p-values, which compared to the significance associated with the test and verified the previous criteria can be actualized.

This correction was performed using p.adjust function available in stats R package.

3.5.5. Pearson correlation test

Pearson´s ρ test was performed aiming to find significant correlations between methylation in gene expression patterns, within primary tumor samples. Correlations between gene expression and DNA methylation of the respective CpG site was tested, where correlation coefficient, ranges from -1 to 1.

This test was performed using cor.test function available in stats R package.

3.5.6. Receiver operating characteristic (ROC curves)

The receiver operating characteristic (ROC) curve is a graphical representation of the pairs sensitivity, true positive fraction (TPR), and 1-specificity, false positive fraction (FPR). All cutoffs result from the coordinates that represents a compromise between sensitivity and specificity. This measure the quality of a diagnostic biomarker, representing the probability of discriminate stage I primary tumor from normal samples (Fawcett 2006; Robin et al. 2011). Moreover, the area under the curve (AUC) is used as an accuracy index (Fawcett 2006).

AUC is used as a quality measure of the curve and it is calculated through the trapezoid rule (Robin et al. 2011). We defined an AUC ≥ 0.8, a sensibility ≥ 0.6 and a specificity ≥ 0.6 to obtain the potential new biomarkers for further clinical applications (Greiner et al. 2000). ROC curves are represented in Figure 3.6A for all genes differentially expressed which presents CpG probes differentially methylated in lung cancer and the Figure 3.6B represents ROC curves considering the mentioned cutoffs criteria.

This analysis was performed using roc and ggroc functions available in pROC R package (Robin et al. 2018).

3.5.7. Multiple Linear Regression Model

Multiple linear regression (MLR) was used to analyse the relationships between a dependent variable (gene expression) and multiple independent variables (DNA methylation). The MLR was useful to verify which CpG probes are statistically significant on the variation of gene expression. The MLR model equation is given by,

Figure 3.6 – ROC curve analysis of cDMGs of lung cancer. (A) All ROC curves were represented at various colors and (B) the selection of area under the curve (AUC) equal or greater than 0.8. Sensitivity represent the probability of true positives and 1-specificity represent the probability of 1-P (false positives), true negatives.

Y



₀



₁

X

₁



₂

X

₂



₃

X

₃

....

X

e

(1) where



, i

1,...,k

are the weight that indicates the relative contribution for each unit measure of Xi in dependent variable variation.

X

, i

1,...,k

are the original independent variables (Hair

et al. 1999).

This analysis was performed using the lm function available in stats R package

In document Whole-genome analysis of DNA methylation across cancer types reveals specific patterns in early stages (Page 66-70)