Methylation of the cytosine residues in eukaryote DNA is thought to act as a mechanism of
gene expression control. In plants, it occurs typically at CpG (Finnegan et al., 1998) residues
but can also occur at CpNpG sites, where N is any nucleotide, and any CpHpH site, where H
represents adenine, cytosine or thymine (Lukens and Zhan, 2007). It was noted in
Arabidopsis that most methylation that would occur in the gene body was mainly at CpG sites, whilst methylation elsewhere and in repetitive regions could be at CpG, CHH and
CHG sites (Widman et al., 2009).
Cytosine methylation within gene promoter regions is thought to inhibit binding of regulatory proteins and repress transcription; it can also silence the transposable elements (TEs) that would otherwise disrupt DNA sequence by transposition. TE transposition can result in altered gene expression, novel regulatory networks, gene deletions, duplications,
increases in genome size, illegitimate recombination and chromosome
breaks/rearrangements (Cantu et al., 2010). As such, reduced DNA methylation is known to
disrupt normal plant development. Methylation within introns and downstream exons has been highly correlated and if such gene body methylation is found it has been associated
with highly expressed genes in some studies (Zhang et al., 2006) while other studies have
found little or no association in this context.Brenet et al.discovered that methylation
downstream of the transcriptional start site i.e. in the first exon region was strongly linked to gene silencing, even more so than methylation of the upstream promoter region. The effects of gene body methylation can be seen to remain controversial and largely without clarification. It is clear that the location of methylation within or around a gene is important,
however, the reasoning for this is, as of yet, poorly understood (Brenet et al., 2011).
In a study by Rabinowicz et al. a small subset of whole genome sequencing data for hexaploid wheat was analyzed and subsequently methylation filtration was utilized in an attempt to isolate hypomethylated genic material from hexaploid wheat for sequencing. Isolated sequenced material was later used in a BLAST search to identify wheat genes. From the whole genome sequencing data wheat was found to have a large number of gene-like sequences relative to other plants (1597 sequencing reads, ~500bp in length, 1.44% genes)
while in the enriched data gene enrichment was comparatively low (1548 sequencing reads, ~500bp in length, 6.78% genes). They predicted that the apparent excess of genes combined with poor enrichment could be due to high levels of methylated pseudogenes (recently amplified and then silenced), reducing the number of active genes to a level closer to that which was expected (Rabinowicz et al, 2005).
Here the study of methylation patterns in wheat was to be used to test a number of hypotheses; firstly, if differential methylation exists between the A, B and D genomes. Secondly, using two growth temperatures for the Chinese Spring to test if temperature is capable of altering the methylation state and to see if this is both genome specific and genome independent. Finally, to investigate if it is this underlying methylation that can control both genome specific and temperature dependent changes in gene expression.
There are three main methods used in the laboratory for the study of methylation patterns;
-Bisulfite treatment deaminates un-methylated cytosine residues converting them to uracil. The conversion of these un-methylated but not methylated residues to uracil allows, after PCR and sequencing, effective discrimination of the methylation status at every cytosine residue making this method the gold standard in methylation studies (Darst et al., 2010). -Differential enzymatic cleavage uses methylation-sensitive restriction enzymes to fragment genomic DNA that can then be analyzed. Enzymatic cleavage is limited by the number of enzyme recognition sites (New England Biolabs, 2009).
-Affinity based methods use antibodies or proteins that bind to methylated DNA resulting in the enrichment of the methylated DNA in the experimental sample to allow downstream analysis (New England Biolabs, 2009).
The clear significance of the impact of methylation on the genome makes it an obvious area for research. In order to study the general effects and patterns of methylation in the hexaploid wheat, without encountering the problems previously detailed due to the large size of the wheat genome, development of a methylation target enrichment array would be the best way forward i.e. the ability to enrich for regions of interest in the genome and to study methylation patterns therein. Sodium bisulfite treatment is an increasingly popular method for epigenetic profiling and combined with the use of Agilent’s SureSelect Methyl-Seq Target Enrichment System allows the study of methylation patterns in target regions. The Agilent enrichment system utilizes 120bp biotinylated RNA baits in solution to capture user- defined regions based on primary DNA sequence. In this system a standard DNA fragment library is hybridized to biotinylated bait probes in solution and streptavidin beads are used to collect the complexes of probes and bound DNA fragments from which the enriched DNA
fragment pool are eluted. The enriched DNA fragments are then bisulfite converted; PCR amplified to convert uracil residues in the sample to thymidine, indexed if necessary and finally sequenced using next generation sequencing technology (Illumina recommended) (Agilent Technologies Inc., 2012). Bioinformatical analysis can then be carried out to differentiate methylated cytosines from un-methylated cytosines and to determine their implications. Such methodology opens up the possibility of cost effective epigenetic profiling in large genomes.
The RNA baits for the SureSelect Methyl-Seq Target Enrichment are based on primary DNA sequence. As such, when designing the wheat methylation array, design-space contigs for the wheat gene capture array were adapted for this purpose. This ensured that probe sequences
were unique, non-repetitive, gene-rich and evenly distributed across the wheat genome. In
this project it is demonstrated that an enrichment array can be used to give an overview of methylation patterns across the genic regions in the wheat genome and designed to target a 6Mb subset of the genic regions of wheat using the 5x Roche 454 genomic DNA wheat sequence generated by Brenchley, R. et al. (subset distributed across the contigs that were selected previously for the gene capture array design-space) (Brenchley et al. 2012).
Modification of gene expression by methylation can be tissue-specific or developmental stage dependent (Wang et al., 2011a). It has been reported that methylation levels between members of the same species can differ, resulting in disease (Langevin and Kelsey, 2013) and can also differ in response to environmental factors or stresses e.g. temperature (Hashida
et al., 2006) or salt stress in plants (Wang et al., 2011a). Further to this allele specific
methylation has also been observed in animals and plants. Notably Wei et al. found allele specific methylation in humans that resulted in allele–specific expression (ASE) of death- associated protein kinase 1 (DAPK1) and predisposition to chronic lymphocytic leukemia (CLL) (Wei et al., 2013).
The potential for differential methylation of homeologous genes in a polyploidy species is an important question in this study. Differential methylation was observed in maize correlating with differential expression of maternal and paternal alleles in the genes r and dzr1
(Kermicle, 1978; Chaudhuri and Messing, 1994). In tetraploid cotton silencing or unequal homeologs expression was observed with epigenetic induction implicated; the proportion of genes with only partial homoeoalleles expressed was predicted to be as high as 25% (Adams
et al., 2003). For hexaploid wheat the percentage of genes with partial homoeoalleles
expressed i.e. genome-wise differential gene expression, is thought to be 29%, typically one of the three homeoalleles present is silenced (Bottley et al., 2006; Wang et al., 2011b).
The array was to be used to test a number of hypotheses in wheat; firstly, Chinese Spring hexaploid wheat DNA could be enriched using the array to see if differential methylation exists between the A, B and D genomes. A list of naturally occurring homeologous SNP positions within the array bait sequences would allow identification of differential methylation between the A, B and D genomes in this analysis. Such SNPs would make it possible to associate sequencing reads with a homeologous SNP allele and ultimately a particular wheat genome. Secondly, using two growth temperatures for the Chinese Spring (12°C to represent a lower more ambient temperature for wheat growth in the UK and 27 °C to represent a contrasting high temperature for wheat growth) such DNA could be enriched using the array to test if temperature is capable of altering the methylation state and to see if this is both genome specific and genome independent. Finally, with use of RNAseq datasets for Chinese Spring at the same two growth temperatures (12°C and 27 °C) gene expression patterns could be identified and correlated with differential methylation under the hypothesis that it is this underlying methylation that can control both genome specific and temperature dependent changes in gene expression. To generate this gene expression data cDNA was generated, sequenced and analyzed by Mark Quinton-Tulloch by mapping the sequence data to the methylation array design-space. The program BitSeq was utilized to allow identification of gene expression levels.
BitSeq is an additional bioinformatical software tool with two main stages: transcript expression estimation and differential expression assessment. For the transcript expression estimation; sequencing reads are taken as input and aligned to the transcriptome using Bowtie to then allow calculation of the probability of a read originating from the transcript to be calculated and finally transcript expression level estimation. For differential expression assessment expression estimates are generated from replicates of 2 or more conditions; it infers the condition mean transcript expression and ranks transcripts based on the likelihood of differential expression (Glaus et al., 2012).