• No results found

Chapter 2: Materials and Methods

2.2 Methods

2.2.9 Bioinformatic analysis

2.2.9.1 Using ECR Browser to view evolutionary conservation

The Evolutionary Conserved Region Browser (ECR Browser; https://ecrbrowser.dcode.org/) was used to identify regions of high conservation around the MIR137 and EU358092 locus, using an alignment of seven vertebrate species against the human genome, including puffer fish (Fugu rubripes), frog

(Xenopus tropicalis), chicken (Gallus gallus), opossum (Monodelphis domestica),

mouse (Mus musculus), dog (Canis familiaris), and rhesus macaque (Macaca

72

or puffer fish were considered to be highly conserved through evolution. This informed the selection of regions to study when testing for evolutionary conserved modulators of gene expression around the MIR137 locus.

2.2.9.2 Using UCSC Genome Browser and Galaxy for bioinformatic analysis

The UCSC Genome Browser (https://genome.ucsc.edu) was used to carry out bioinformatic analyses across the human genome using the 2009 GRCh37/hg19 genome build. In particular, the UCSC table browser was accessed through the web- based platform Galaxy (https://usegalaxy.org/). This allowed the download, upload, and intersection of data sets available on the UCSC Genome Browser, as well as custom data sets. Specifically, the UCSC Genome Browser and Galaxy were used to download and overlay ENCODE ChIP-seq data for transcription factors EZH2, SUZ12, and REST, with the transcriptional start sites of all genes annotated on the UCSC ‘known gene’ track. This allowed identification of gene sets with EZH2, SUZ12, or REST binding within 500 bp of the transcriptional start site. Similarly, these tools were used to identify and download co-ordinates of retrotransposable element subfamilies SVA, L1HS, L1PA2, and L1PA3 from the human ‘repeat masker’ data set, and identify their proximity to genes on a genome-wide scale by overlaying their co-ordinates with the co-ordinates of all known transcripts in the ‘known gene’ data set, adding 5 kb upstream to capture retrotransposable elements in the promoter region of genes.

Data sets could be specified and downloaded through Galaxy or the UCSC Table Browser and saved as browser extensible data (.BED) files. All .BED files used in this thesis are available on the accompanying disk in the Supplementary Files folder. These data sets could then be loaded into UCSC separately and intersected using the UCSC Table Browser. For example, after downloading and saving the co-ordinates of all EZH2 binding signals across the genome from the ENCODE ChIP-seq data set,

73

and all the transcriptional start sites in the genome from the ‘known gene’ data set, 500 bp was added and subtracted from each transcriptional start site co-ordinate in Excel to give a 1 kb minimal promoter region. These co-ordinates were then saved as a new .BED file and uploaded through the UCSC Table Browser. To identify all transcriptional start sites with EZH2 binding within 500 bp, the Table Browser tool was instructed to intersect data from the two files and return a list of transcriptional start site co-ordinates and corresponding gene names only for those regions which overlapped the co-ordinates of EZH2 binding in the second file.

Further, the UCSC Table Browser can be used to upload data from external sources for viewing alongside other data sets available through the genome browser. For example, schizophrenia GWAS data was downloaded as a .BED file from the PGC website and uploaded to the UCSC Genome Browser to overlay with conservation data.

2.2.9.3 Using Ricopili to visualise schizophrenia GWAS data at the MIR137 locus

Ricopili (https://data.broadinstitute.org/mpg/ricopili/), a web-based GWAS visualisaition tool hosted by the Broad Institute, was used to view the distribution of schizophrenia GWAS SNPs from the Psychiatric Genomics Consortium’s 2013 schizophrenia GWAS data set (‘PGC_SCZ52_may13’) around the MIR137 locus.

2.2.9.4 Using HapMap Genome Browser and HaploView for linkage disequilibrium analysis

SNP genotype data for the CEU/CEPH European cohort across the MIR137 locus (chr1:98,498,912–98,595,043 and chr1:98,105,779–98,855,147) was downloaded from the now retired HapMap Genome Browser, release #28 (August 2010), and

74

(https://www.broadinstitute.org/haploview/haploview). Haploview is freely available software supported by the Broad Institute which enables analysis of linkage disequilibrium (LD) and the definition of haplotype blocks (Barrett et al. 2005). LD analysis was performed using default parameters and outputting the D prime (D’) statistic. The D’ statistic is derived from D, the coefficient of linkage disequilibrium, which measures the difference between the observed frequency of alleles at adjacent loci on a single chromosome, and the expected frequency if the alleles were segregating randomly. Because the D statistic value is reliant on the frequency of the alleles in question, this can be a problematic measurement when considering different groups of alleles. Therefore, the D’ statistic is a normalised version of the D statistic, which is generated by dividing D by the theoretical maximum difference between the observed and expected allele frequencies. Alleles are said to be in complete LD when D’ is equal to one, or high LD when D’ is equal to or higher than 0.8. Haplotype blocks were determined using default parameters, as defined by Gabriel et al (Gabriel et al. 2002).

2.2.9.5 Using HaploReg v4.1 to access chromatin state and histone modification data in a range of human tissues and cell lines

SNPs of interest were input into HaploReg v4.1

(http://archive.broadinstitute.org/mammals/haploreg/haploreg.php), a web-based tool for exploring data on chromatin states and histone marks around the input SNPs based on ChIP-seq data in 130 human tissues and cell lines (Ward and Kellis 2012, Ward and Kellis 2016). This allowed us to determine whether the regions including the selected SNPs had potential transcriptional regulatory properties, and if so, which tissues these regulatory regions may be active in.

75

2.2.9.6 Using Enrichr to perform enrichment analysis on gene lists

Gene lists generated through the overlaying of transcription factor binding or retrotransposon co-ordinates with the co-ordinates of annotated genes (Section 2.2.9.2) were analysed using the Ma’ayan Lab Enrichr tool (http://amp.pharm.mssm.edu/Enrichr/) (Chen et al. 2013, Kuleshov et al. 2016). Enrichr performs enrichment analysis of an input gene list against previously annotated gene lists known to be involved in specific pathways or functions by drawing data from numerous sources including those providing data on regulation, pathways, ontologies, tissue distribution, and disease states. Enrichr uses a pre-computed look up table with expected values for each enrichment term based on a large number of random test gene lists. The standard deviation of the original input list from the expected value based on random gene lists is used to determine the significance of the input gene list for each particular enrichment term. We predominantly made use of enrichment data for our gene lists based on information from the Gene Ontology Consortium (GO), and from the Mouse Genome Informatics (MGI) mouse phenotype data set, both of which are available through Enrichr.

2.2.8.7 R programming language for analysis of retrotransposon distribution data

R is a freely available software package, language, and environment that is designed for handling large data sets and performing statistical computing. In this instance, we used R to handle large data, with a short script written by Dr. Bethany Geary to count the number of retrotransposons per Mb across the human genome based on input data downloaded from the UCSC Genome Browser’s ‘repeat masker’ data set (Section 2.2.9.2). Further, we used a publicly available script provided by Dr Giovanni M Dall’Olio through the BioStars website to count the number of transcripts per

76

megabase across the genome (https://www.biostars.org/p/169171/#169211). Briefly, this accessed the human genome build 19 ‘known gene’ data set from UCSC through the ‘human.genes’ object using the ‘Homo.sapiens’ Bioconductor package, and counted the number of transcripts present in a specific genomic range (in this case, windows of 1 Mb).

77