• No results found

Additional methods

A short introduction to the linear regression framework

Linear models are a broad class of statistical analyses that are at the core of many bioin- formatic methods, including differential RNA expression analyses [Ritchie et al., 2015] or genome-wide association studies (GWAS) [Visscher et al., 2017]. An instance of such models is linear regression [Eaton, 2007], a statistical approach that allows modelling of the relationship between:

• A dependent variable Y, with observations yi∈ R and i ∈ {1, ..., n}, where n is the total number of observations (i.e. samples).

• One or more independent variables Xj, with observations xi j ∈ R and j ∈ {1, ..., k}, where k is the total number of independent variables (a.k.a covariates). These variables can indicate, for example, whether a specific condition or phenotype is present in a given sample, quantify the effects of a continuous variable (such as chronological age) or adjust for the effects of batch effects; which gives this statistical framework a great analytical flexibility [Ritchie et al., 2015].

We can describe the dependent variable Y as a function of the independent variables Xj:

yi= k

j=1

xi jβj+ εi (2.19)

where βj are unknown parameters that need to be estimated from the data and εiis the random error. In matrix form:

y = Xβ + ε (2.20)

where y ∈ Rnis the vector {y1, ..., yn}, X ∈ Rn×k is the n × k matrix of xi j’s, β ∈ Rkis the vector {β1, ..., βk} and ε ∈ Rnis the vector {ε1, ..., εn}.

Assuming that E(ε) = 0, Var(ε) = σ2> 0 and Cov(ε) = σ2In(where Inis the n × n iden- tity matrix) and applying the Gauss-Markov theorem [Eaton, 2007], it can be demonstrated that:

ˆ

β = (X′X)−1X′y (2.21)

where X′is the transpose of X and ˆβ is the least-squares estimator of β , since it minimises:

n

i=1 (yi− k

j=1 xi jβˆj)2 (2.22)

It is possible to test whether there is a statistically-significant linear association between the dependent variable (Y) and one of the independent variables (Xj) i.e. to test:

H0: βj= 0 against HA: βj̸= 0 (2.23)

where H0is the null hypothesis and HAis the alternative hypothesis. A t-statistic (T ) can be derived after performing the fitting of the linear regression model [Sheather, 2009]:

T =

ˆ βj se( ˆβj)

(2.24)

where se( ˆβj) is the standard error of ˆβj. When H0is true, then the statistic T follows a Student’s t distribution with n − k degrees of freedom i.e. T ∼ tn−k. This allows to estimate the p-value for the linear association of Y with a given Xj.

Finally, it is worth mentioning the nomenclature that I used for the linear regression models along this thesis. For example, the following model fits a linear association between the dependent variable (e.g. β -value at a specific CpG probe in the array) with intercept and 3 covariates (e.g. age, sex and disease status):

     y1 y2 ... yn      =      1 x11 x12 x13 1 x21 x22 x23 ... ... ... ... 1 xn1 xn2 xn3           β0 β1 β2 β3      +      ε1 ε2 ... εn      (2.25)

where yiis the β -value at a certain CpG probe for the ith sample, xi1is the age for the ith sample, xi2 is the sex (e.g. 0 for male and 1 for female) for the ith sample, xi3 is the

disease status (e.g. 0 for a healthy individual and 1 for an individual with a disease) for the ith sample, β0 is the intercept coefficient, βj are the covariate coefficients ( j = 1 for age,

j= 2 for sex, j = 3 for disease status) and εiis the error for the ith sample.

Throughout this thesis, I use the following nomenclature to describe the model above (‘R-style’ nomenclature):

Biological aspects

‘At a fundamental level evolutionary survival is the preservation of a dynamic balance between information, or order, and entropy, or disorder.’

T. B. L. Kirkwood [1977]

Declaration

This chapter in mainly the product of my own work. Additionally, I would like to recognise the contributions of Janet M. Thornton, Wolf Reik and Thomas M. Stubbs (who helped designing the study and interpreting the data), Erfan Aref-Eshghi (who run some of the analyses using my code and provided part of the samples in the dataset), Marc Jan Bonder and Oliver Stegle (who provided statistical input) and Bekim Sadikovic (who provided part of the samples in the dataset). All of them also helped in the revision of the final text. This work has been published in the journal Genome Biology [Martin-Herranz et al., 2019].

3.1

Background

Epigenetic clocks can be understood as a proxy to quantify the changes of the epigenome with age. However, little is known about the molecular mechanisms that determine the rate of the underlying epigenetic ageing clock (see section 1.3.3). Steve Horvath proposed that the multi-tissue epigenetic clock captures the workings of an epigenetic maintenance system [Horvath, 2013a]. Recent GWAS studies have found several genetic variants associated with epigenetic age acceleration in genes such as TERT (the catalytic subunit of telomerase) [Lu et al., 2018], DHX57 (an ATP-dependent RNA helicase) [Lu et al., 2016] or MLST8 (a subunit of both mTORC1 and mTORC2 complexes) [Lu et al., 2016]. Nevertheless, to my

knowledge no genetic variants in epigenetic modifiers have been found and the molecular nature of this hypothetical system is unknown to this date.

I decided to take a reverse genetics approach and look at the behaviour of the epigenetic clock in patients with developmental disorders, many of which harbour mutations in pro- teins of the epigenetic machinery [Aref-Eshghi et al., 2018b; Bjornsson, 2015]. I performed an unbiased screen for epigenetic age acceleration and found that Sotos syndrome accelerates epigenetic ageing, potentially revealing a role of H3K36 methylation maintenance in the regulation of the rate of the epigenetic clock.