Relationship with sequence features - Statistical modeling of chromatin structure

2.4 Statistical modeling of chromatin structure

2.4.3 Relationship with sequence features

The relationship between DNA flexibility, nucleosome positioning and the sequence was more quantitatively investigated by Lee et-al(2007) Lee et al. (2007). It was found that the CTG trinucleotide correlates well with nucleosome occupancy while poly A-dT correlates negatively. A lasso model was constructed to bring out a linear relationship between the tip, tilt, twist angles, the sequence composition, and the transcription factor binding sites. Propeller twist capacity emerged as the most significant variable, (

dinucleotides having highly negative propeller twist angles were found to be more rigid than those having lower negative values) The AAAA tetranucleotide was found to contribute to the rigidity of DNA conformation.

Nucleosomes were mostly found in coding regions and centromeres whereas nucleosome depleted regions corresponded with intergenic regions, most of which were promoter regions. In fact nucleosome free regions marked the boundary of transcription start sites, in that they were found just upstream of these start sites. A ladder of well positioned nucleosomes was found start from the beginning of transcription start sites. The nucleosome depleted regions also corresponded well with the occurence of transcription factor binding sites, which are clustered 100 bp upstream of the transcription start sites.126 known transcription factor binding sites and their position specific weight matrices were taken, and Wilcoxon mann whitney pvalues were computed for nucleosome occupancy. A strong relationship between nucleosome occupancy and the occurence of TFBs was observed. Four main clusters of coding regions were identified based on their nucleosome occupancy signature. Interestingly, these regions corresponded to four different gene regulatory functions.

(2007)Peckham et al. (2007) and Yassour-et-al(2008)Yassour et al. (2008). Yassour et al extended the HMM technique employed by Yuan-et-al. They implemented a hidden markov model with multiple states, in order to account for unstable nucleosome position and to adjust for global trends. Peckham-et-al employed a support vector machine for classification of nucleosomal and non nucleosomal states. They tested the efficiency of classification by using ROC curve. Their results were similar to that of studies, where G-C sequence features were found to be positively correlated with nucleosome formation, and A-T appeared to have inhibitory influence.

In 2008 Yuan and Liu Yuan and Liu (2008) used wavelet decomposition to represent the sequence signal and then used them as covariates in a logistic model. Based on the predictions from this model, they came up with the N -score, a statistical measure that quantifies the the tendency of a sequence to be occupied by a nucleosome. Below, we give a short sketch of the derivation of the N score.

Construction of the N score: 199 Nucleosomal DNA sequences were determined

experimentally. They were then aligned by both forward and reverse strand as in Segal et al. Each nucleosomal DNA was about 145-153 base pair long. 296 NFR sequences were determined using a tiling micro array. Each NFR sequence had a 100 bp long linker region. From both the nucleosomal and NFR sequences, the central 131 bp was retained for statistical analysis. The linker sequences which were shorter than 131 bp was

expanded symmetrically on both sides. The reverse strands were also added to the data set.

Next, for each sequence and each dinucleotide d, a 0-1 vector of length 130 is set up. The jth _{component of the vector denotes the indicator of whether the dinucleotide is in} _jth and the (j+ 1)th position. From this, a vector of length 128 is computed by averaging over the entire sequence. Theith element in this vector is denoted by fs,d(i/128)

The crucial step in the construction of N score comes in the wavelet transform of each of these dinucleotide frequency signals. Each of these signals is written as a linear

combination of wavelets is the following manner.

fs,d(i/128) = X

whereψj_k is a Haar wavelet function defined as follows:

ψj_k(x) = 1 0< x <1/2 (2.4.7) =−1 1/2< x <1 (2.4.8)

= 0 otherwise (2.4.9)

The wavelet transform coefficients cj_k(s, d) can be obtained as in the coefficients for an orthonormal basis. (Note thatψ_kj(x) construct an orthonormal basis for a 0-1 valued function)

cj_k(s, d) =X i

fs,d(i/128)ψ_kj(i/128) The energy of a signal at level j is defined as

E =X k

[cj_k]2

It represents the variance of the signal at 27−j level. For 8 dinucleotides, we have 8 energy values for a given signal.

A logistic regression is now performed where the response is the indicator function of being a nucleosomal sequence, and the covariates are the 8 energy signals.

log P(s)

1−P(s) =β0+ X

βlxl(s)

wherexl(s) is the energy of the lth dinucleotide in sequence s. A stepwise procedure is then implemented to retain only the important covariates. The N score for each sequence is defined as the predicted logit from this model.

The N score model was validated by a two fold cross validation.The linker and nucleosomal sequences were divided into two groups, such that there are equal

proportions of both types in each group. While one dataset was used to train, the other was used for testing. This was repeated five times. A ROC curve was obtained and the area under the ROC curve, i.e, the ROC score was used to tests its performance against the other models. One surprising result was that N scores derived from yeast data

matched quite well with the human genome data indicating that sequence specificity of nucleosomes may be conserved across eukaryotes.

Segal’s model with a ROC score of .67 performed poorly against the N-score model. But the poor performance of the model was due to the fact that it did not use a

discriminative approach for nucleosome prediction, as discussed earlier. It relied only on the properties of the nucleosomal sequence data for prediction. The ROC score for the support vector machine model was .82, insignificantly lower than the N score model. When the linker sequence data was incorporated into the Segal model, the ROC score shot up to .81, very close to that of the N score model. This showed that the superiority of the N score model was not derived from its use of wavelets, but in the fact it was the first nucleosomal prediction algorithm that use discriminative approach.

Poly Da-Dt tracks have been found to be correlated with the occurence of nucleosome free regions. The N scores helped to verify the result once again. Relationship between the N score at poly Da-Dt loci and the length of the ploy Da-Dt run was investigated. A negative correlation of .15 ( pvalue less than .001) was obtained, confirming the

experimental evidence, that poly Da-Dt tracks are depleted of nucleosomes. From the results of the logistic regression, it was seen that TT/AA/TA were the most important predictors, followed by TA/AC/GT, while GC had moderate predictive power. These results match with the results obtained from earlier studies. One interesting result was that the important predictors were related with nucleosome depletion rather than

nucleosome formation. This suggested that the main role of the sequence information lay in determining boundaries of the nucleosome free regions.

Distribution of N score was compared at the transcription factor binding sites, unbound motif sites, and the genomic background. The average N score over the transcription factor binding sites is -1.03 which is less than that in the unbound motif sites(.70) and the genomic background (.77).

An interesting property of nucleosome occupancy investigated by Struhl et al Sekinger et al. (2005), was that deletion of sequence elements from promoter region increased DNA accessibility. In order to test this hypotheses, a computational experiment was

performed. From every set of promoter regions, DNA sequences were progressively removed, starting from 10 bp. At each step 10 was further removed, and this was

continued till 200 bp. The change of n score was calculated for every deletion. A positive correlation of .48 (pvalue < .001) was obtained between the length of deleted sites and the N score. This result had motivated us to incorporate sequence features into our model for nucleosome prediction.

The nucleosome models discussed so far, however,leave sufficient room for significant improvements . In the following section, we discuss them and also suggest the approach that we took in this regard.

In document Bayesian model based approaches in the analysis of chromatin structure and motif discovery (Page 34-38)