Markov models for local environments

RNA genes SNPs

4.7 Chapter summary

5.2.2. Markov models for local environments

The aim of testing for SNP sensitivity is to identify T1D-SNPs in binding sites that cause significant change in the binding signal of their local environment (i.e. the binding motif in which they occur). Testing for SNP sensitivity starts with fitting a Markov model 15_{(Fink, 2007)}

for the local environment of a SNP. This is a selected part of the regulatory region surrounding a regulatory SNP. The local environment is made up of 601 base pairs, with 300 bps flanking the SNP on both sides. For each TFBS-SNP, this should typically encompass the binding site that overlaps the SNP (Figure 23).

15_{An algorithm (Markov algorithm 1) implemented in python 2.7, establishes the Markov order of nucleotide} dependency of each 601bp local environment, and for both alleles of the 92 TFBS-SNPs. A full description of the method can found in (Appendix C).

SNP counts Associated SNPs Non-Associated SNPs TFBS 0 92 REG 22 10085 NON-REG 57 250125 Total 79 260302

Figure 23. Local environment and neighbourhood of TFBS-SNPs

The fitted Markov model predicts the sequential characteristics of the local environment of the SNP. The order of the fitted model 𝒎 is an estimate of the degree of sequential dependency of DNA nucleotides in the region (Edwards et al., 2009). Model fitting is done twice per regulatory sequence; first, with the sequence containing the reference allele of the SNP and again with the sequence containing the mutant allele of the SNP. Establishing the Markov models separately ensures for proper computation of expected probabilities for signal representation that will later be done. Ideally, the calculation of expectancies should be on the basis of the established Markov order of the sequence, and it is possible that the order of a given sequence may differ between both alleles of the SNP. A detailed explanation of how the regulatory sequences are fitted with the Markov model and the algorithm design is given in Appendix C. The Ensembl Biomart tool was used to select local environments of each SNP such that the 300 flanking nucleotides remain within the regulatory module in which the SNP occurs. The reason for this is that the Markov order could also differ within a sequence between the regulatory and non-regulatory parts.

Findings: Markov models for local environments

Markov models could only be established for orders 𝒎 = 0, 1 or 2; and only for less than half of the sequences (Table 15). This is either due to the length of the sequence (601 bps), or due to strong non-stationarity. In the first case, it has been shown that the number of nucleotides used to construct a Markov model limits the order to be fitted (Thijs et al., 2001). Exponentially larger sequence lengths are needed to build appropriate transition matrices needed to fit a model for sequences with higher Markov orders. In the second case, it is unlikely that DNA sequences are simple, stationary and low-order Markov chains. A stationary series is one with statistical properties that are constant overtime. Such properties would include the mean, variance, auto- correlation and so on. In a stationary series, there is no change or relationship between adjacent time periods, and the series may be referred to as time homogenous or memoryless. Conversely,

𝑹𝒆𝒈𝒖𝒍𝒂𝒕𝒐𝒓𝒚 𝒔𝒆𝒒𝒖𝒆𝒏𝒄𝒆 𝒘𝒊𝒕𝒉 𝑺𝑵𝑷 𝒊𝒏 𝒃𝒊𝒏𝒅𝒊𝒏𝒈 𝒔𝒊𝒕𝒆 … 𝑨𝑪𝑮𝑻𝑨𝑪𝑮𝑻𝑨𝑪𝑮𝑻𝑨𝑪𝑮𝑻𝑨𝑪𝑮𝑻𝑨𝑪𝑮𝑻𝑨𝑪𝑮𝑻 𝑨 𝑨𝑪𝑮𝑻𝑨𝑪𝑮𝑻𝑨𝑪𝑮𝑻𝑨𝑪𝑮𝑻𝑨𝑪𝑮𝑻𝑨𝑪𝑮𝑻𝑨𝑪𝑮𝑻 … 𝑺𝑵𝑷 𝑨𝑪𝑮𝑻𝑨𝑪𝑮[𝑻 𝑨]𝑨𝑪𝑮𝑻𝑨𝑪𝑮 𝒃𝒑𝒔 𝑻 𝑺 𝑺𝒆𝒒𝒖𝒆𝒏𝒄𝒆 𝒘𝒊𝒕𝒉 𝑺𝑵𝑷 𝒊𝒏 𝒃𝒊𝒏𝒅𝒊𝒏𝒈 𝒎𝒐𝒕𝒊𝒇 300 bps 300 bps

63 in a non-stationary process, there is a difference or relationship in properties between adjacent periods over time (Chatfield, 2003; Priestly, 1981).

Table 15. Number of established Markov models for three types of SNPs. The sequences for which Markov models could not be established are assumed to have Markov orders > = 3.

In DNA, the genomic signals are likely non-stationary because there is a statistical difference between adjacent coding and non-coding sequences. For example, a three-base periodicity (second order dependency) of nucleotides has been established for coding regions (Howe et al., 2013). Regulatory regions in non-coding DNA also contain distinct motifs that deviate from zero and first-order dependency (Howe et al., 2013; Abnizova and Gilks, 2006; Thijs et al., 2001). This makes them typically non-stationary and their sequence of a fractal nature (Abnizova et al., 2007). These notions are supported by the data presented in Table 15, which indicates a difficulty in establishing models for many regulatory and non-regulatory sequences. Interesting though, a chi square test indicates a significant association between Markov order and genic region (2_{= 29, df = 6, p < 0.001). Chi-square values are also significant if the test is restricted}

to two categories of genic regions: i) NON-REG and ALL-REG ({TFBS + REG}, i.e. all regulatory sequences, including those with SNPs in binding sites), (2_{= 13.35, df = 3, p < 0.004);}

(ii) TFBS and REG, (2_{= 15.44, df = 3, p < 0.0025). Standardized residuals point to an over-}

representation of 𝒎 = 0 models for NON-REG regions, 𝒎 = 1 models for ALL REG regions and more 𝒎 = 2 models than expected by independence for TFBS regions (Table 16). In addition, a regulatory region may overlap with another type of genic region, for instance an exon, which may lead to complex dependencies.

Table 16. Standardized residuals after chi-square tests for associations between genic regions and Markov orders fitted to the data of Table 15. m = Markov order; G-R = Genic Region.

Markov model N % N % N % 0 13 14.13 49 12.53 41 10.28 1 5 5.43 68 17.39 86 21.55 2 3 3.26 42 10.74 18 4.51 Not Established 71 77.17 232 59.34 254 63.66 92 391 399 TFBS-SNPs REG-SNPs NON-REG-SNPs

Genic-Region TFBS REG NON-REG G-R ALL- REG NON-REG G-R TFBS REG

m m m

0 0.688 0.494 -0.820 0 0.745 -0.820 0 0.346 -0.17 1 -2.845 -0.296 1.659 1 -1.508 1.659 1 -2.39 1.158 2 -1.393 2.663 -1.967 2 1.788 -1.967 2 -1.9 0.923 higher 1.692 -0.950 0.128 higher -0.116 0.128 higher 1.749 -0.85

64 It is also important to point out that the total number of the SNPs analysed in the REG- and NON-REG-SNP categories have each been revised (from 400 to 391 and 399 respectively) because some of the selected SNPs have been described as “failed SNPs” in the recently updated version of the Ensembl database (v 73, 80). These are SNPs that have not passed a quality control pipeline16_{set by Ensembl for SNPs.}

In document The Genomics of Type 1 Diabetes Susceptibility Regions and Effect of Regulatory SNPs (Page 69-72)