Building Biomedical Infrastructure for Aggregating Variant Annotations and Prioritizing Disease Variants

(1)

VARIANT ANNOTATIONS AND PRIORITIZING DISEASE VARIANTS

_______________

A Thesis Presented to the

Faculty of

San Diego State University

_______________

In Partial Fulfillment

of the Requirements for the Degree Master of Science

in

Bioinformatics & Medical Informatics

_______________

by

Adam Maika’i Mark Summer 2015

(2)

(3)

(4)

ABSTRACT OF THE THESIS

Building Biomedical Infrastructure for Aggregating Variant Annotations and Prioritizing Disease Variants

by

Adam Maika’i Mark

Master of Science in Bioinformatics & Medical Informatics San Diego State University, 2015

The recent advent of massively parallel sequencing technologies has allowed academic and clinical research to shift focus into genome wide discovery of variations for investigation of their role in disease. Attempts to comprehensively annotate genetic variants for population frequency, pathogenicity, and clinical relevance are distributed across

numerous efforts. Currently, annotating genomes and variants involves downloading and parsing flat-files and uploading the data into a database. This becomes especially tedious when studying complex and heterogeneous diseases with respect to variant type or when annotations need to be updated. In order to alleviate fragmentation and allow systematic interpretation of genomic variation data, we have developed MyVariant.info.

MyVariant.info is an aggregation of human genetic variant databases into a unified, queryable web service. Development required aggregating and loading data, developing a programmatic interface for interactively querying and exploring annotations, and integrating the service into a variant discovery pipeline. I will also demonstrate a usage case for

(5)

LIST OF TABLES

PAGE

Table 1. SQL to MongoDB Terms and Concepts Mapping Chart ...8

Table 2. HGVS Nomenclature ...9

Table 3. Number of Variant Documents in MyVariant.info ...10

Table 4. Number of Genes Shared by at Least x=4,...x=10 Patients ...21

Table 5. Mean CADD Scores of Candidate Disease Genes ...21

(7)

LIST OF FIGURES

PAGE Figure 1. Schematic of my contribution to MyVariant.info. Box A- Dataloading:

parsing flatfiles, mapping to JSON structure, and uploading into MongoDB. Box B- developing clients in R (and Python) for making queries to the

MyVariant.info API. Box C- Implementing the MyVariant.info application to

annotate variants from a disease study and reveal candidate genes. ...7

Figure 2. JSON variant document. Annotations for variant chr1:g.160145907G>T. Some fields are collapsed for viewing ease. The merged example shows annotations available from multiple sources...11

Figure 3. mygene download statistics. ...13

Figure 4. mygene weekly usage statistics. ...14

Figure 5. getVariant example. ...15

Figure 6. queryVariant example. ...15

(8)

ACKNOWLEDGEMENTS

This work was funded by the National Institute of General Medical Sciences of the National Institutes of Health under award number R01GM083924.

The Next Generation Mendelian Genetics project was provided by NIH grant 1RC2 HG005608-01 to Drs. Debbie Nickerson, Jay Shendure, Michael Bamshad, and Wendy Raskind, and research on Kabuki Syndrome by 5RO1-HD48895 to Michael Bamshad. The dataset(s) used for the analyses described in this manuscript were obtained from the database of Genotype and Phenotype found at http://www.ncbi.nlm.nih.gov/gap through dbGaP accession number phs000295.v1.p1.

I’d like to thank my committee members Faramarz Valafar and Elizabeth Waters. Each member of the lab of Andrew Su was helpful in training me, especially Tobias Meissner, Erick Scott, and Louis Gioia. Ryan Thompson of the Salomon lab contributed to the development of mygene and was a great mentor in R. A special thank you to Chunlei Wu for the guidance in the development MyVariant.info and good programming practices.

Thank you to Andrew Su for taking special time to train me in bioinformatics and tool development for applications in precision medicine. It was an absolute privilege to be trained in this lab.

(9)

CHAPTER 1 INTRODUCTION

Genetic variants are DNA sequence variations occurring in the common population. These may occur due to environmental factors such as radiation or chemicals, errors in DNA replication, or from nucleotide insertion and deletion events caused by endogenous mobile elements or pathogenic invasion. Variants can be classified into several categories including single-nucleotide polymorphisms (SNPs), deletions and insertions of nucleotides, inversions, and copy-number variations (CNVs).

The recent advent of next generation sequencing technologies has allowed academic and clinical research to shift focus into genome wide discovery of variations for investigation of their role in disease. The explosion in technologies has inundated scientists with sequence variation data yet little streamlined access to systematic interpretation.

Genetic variant annotations are descriptions of the mutation. Common fields include genome location, reference and alternate alleles, population allele frequency, phenotype, functional consequences, and often references to experiments and PubMed articles.

Annotations are obtained from empirical studies and submitted formally by individual labs. They may also be extracted from PubMed articles individually or part of large genotype-phenotype studies.

Annotations can be utilized for several purposes. Functional analysis helps to predict how damaging or benign a variant may be. Nonsynonymous changes in coding may result in gain of a stop codon or amino acid change that affects structural stability of a protein. Null and synonymous variants are equally important for control experiments and often used for prioritization and filtering purposes in Mendelian disease studies.

(10)

A

NNOTATION

D

ATABASES

Attempts to comprehensively annotate genetic variants are distributed across numerous efforts. There are three general types of annotation databases: population studies which serve to provide a distribution of variant frequencies in various populations, functional prediction databases which score variants based on level of deleteriousness, and clinical databases which document variants detected from empirical studies.

Population Databases

Population databases are important for recording variants that commonly appear in diseased as well as healthy individuals. The NHLBI GO Exome Sequencing Project1 has released a database called the Exome Variant Server (EVS) that houses phenotypic data from 6503 samples, representing over 200,000 thousand individuals from richly phenotyped populations. The goal of this project is to discover novel genes and mechanisms contributing to heart, lung, and blood disorders.

Healthy variants are as important for helping to document null effects as well. Scripps Genomic Medicine researchers sequenced the DNA of America’s “well elderly” for the Wellderly study1. The idea is that DNA of those who are 85 years and older with no history of chronic disease hold insight into longevity.

The 1000 Genomes Project2 is a wide-scale, collaborative effort to sequence

thousands of individuals from several populations to characterize a distribution of common variants. Low coverage whole genome sequencing of 179 individuals from four populations and high coverage whole exome sequencing of 697 individuals from seven populations provide a list of approximately 15 million SNPs, 1 million indels, and 20,000 structural variants that may be used for control in disease studies.

1_{Scripps Wellderly Genome Resource, The Scripps Wellderly Study, La Jolla, CA} (URL: stsi-ftp.sdsc.edu) April, 2015.

(11)

Functional Impact

There are two types of databases that house functional prediction scores for how deleterious a mutation is: machine learning and conservation-based methods. Machine learning algorithms include Polyphen-23, FATHMM4, MutationTaster5, and CADD6. Polyphen-2 uses sequence and structure-based predictive features and calculates the naïve Bayes posterior probability that a given mutation is damaging. Polyphen-2 offers two types of scores, each trained on separate datasets: HumDiv and HumVar. HumDiv data comes from all (over 3000) damaging alleles from the UniProt database annotated to be causal of Mendelian diseases and HumVar data comes from all (over 13,000) disease causing mutations from UniProt and roughly 9,000 nonsynonymous SNPs without previous

annotation for disease. Furthermore, the estimates of false positives and true positives assist in the qualitative evaluation for a mutation to be benign, possibly damaging, and probably damaging.

Combined Annotation Dependent Depletion (CADD)6, is a tool that integrates 63 annotation types to score deleteriousness of SNPs and insertion or deletions in the human genome. CADD will be described in further detail in Chapter 3. Variant Effect Predictor7 (VEP), an annotation tool, was used to generate data for training a support vector machine on 29.4 million alleles, half fixed in the human population and half simulated de novo variants. The resulting scaled scores ranked stop-lost, stop-gained, canonical splice, and

nonsynonymous variants as the most deleterious events.

In addition to algorithms based on machine learning, another class of methods relies on conservation scores. These methods include SIFT8, MutationAssesser9, SiPhy10, GERP11, and LRT12. Of particular interest is the most popular conservation method, SIFT8. SIFT, which stands for Sorting Intolerant from Tolerant, considers the probability of substitution for all amino acids at the given position. Scores are compiled from homologous sequence alignments using the PSI-BLAST algorithm. Highly conserved positions are typically less tolerant to variation whereas less conserved positions are more tolerant.

Similarly, GERP utilizes a comparative genomics approach. This tool identifies sites under evolutionary constraint by aligning sequences from multiple divergent species to estimate the probability of substitution at each site.

(12)

dbNSFP13, the database of nonsynonymous functional prediction, is an integrated database containing functional prediction scores from several machine learning and

conservation-based algorithms for all possible nonsynonymous SNPs in the human genome. dbNSFP also provides population allele frequencies of variants mapped from the 1000 Genomes Project and Exome Aggregation Consortium.

Clinical and Phenotype

Popular databases for clinical annotation of variants include ClinVar14, COSMIC15, and dbSNP. ClinVar is a source of curated variant annotations that relate mutation to clinical health. COSMIC, the Catalog Of Somatic Mutations In Cancer and dbSNP16, the database of Single Nucleotide Polymorphisms, are popular databases for annotating clinical relevance and phenotype. dbSNP is an NCBI hosted database for annotations of SNPs, small insertion and deletion events, as well as structural variants. dbSNP accepts data submitted from association studies, disease studies, and even null variants with no implications for disease.

Databases may fall under both categories of clinical and population-based such as the Exome Aggregation Consortium2 (ExAC). ExAC is a dataset consisting of variants from over 60,000 exomes sequenced as part of various disease-specific and population genetic studies from around the world that includes data from 1000 Genomes Project, ESP6500, and several other disease studies.

Annotation Tools

Currently, annotating genomes and variants involves downloading and parsing flat-files and uploading the data into a database. Scientists struggle with tedious aggregation of databases for their own use. This task threatens the theme of reproducibility when

annotations need to be updated. With the steady influx of variant annotations, it is common for a scientist to miss annotations.

Popular existing software applications include ANNOVAR17, VEP, UCSC’s Variant Annotation Integrator18,19, and SnpEff20 which help to deliver annotations of called variants

(13)

from collections of databases. ANNOVAR is able to provide select annotations from locally stored databases if they conform to a specific format. Variant Effect Predictor is a tool that predicts functional consequences from SIFT and Polyphen2 scores and can provide allele frequencies and clinical annotations available from 1000 Genomes Project and ESP6500. Similarly, UCSC’s tool provides annotations available from dbNSFP, dbSNP, and COSMIC. SnpEff is another widely used tool, capable of annotating variants with information from dbNSFP, dbSNP, ENCODE21 project, 1000 Genomes Project, as well as other databases.

Aggregation of variant annotations aims to lessen a researcher’s challenge to

prioritize candidate genes in disease studies. However, there are no tools available to provide the sufficient level of information to comprehensively annotate an experiment to discover novel variants. Secondly, these annotation tools are command line tools where input is a file and output is another file. None of these tools provide a programmatic interface to

(14)

CHAPTER 2 MYVARIANT.INFO

We have developed MyVariant.info, an aggregation of genetic variant databases into a unified, queryable web service with a REST API to alleviate the segregation of annotation data and allow systematic interpretation of genomic variants. While variant annotation from aggregated sources is not a novel concept, the innovation is derived from collaborative community contribution and its purpose as a queryable portal to knowledge discovery and data mining.

MyVariant.info builds on the success of MyGene.info22, an effective annotation resource that provides programmatic access to gene-centric annotation data. Specifically, MyGene.info is a set of web services that aggregates data dispersed amongst annotation databases including UCSC, Netaffy, UniProt, Pharmgkb, Ensembl, CPDB, Entrez and more. More than 50 fields of information are provided for 17,000,000 genes and almost 14,000 species. It is updated weekly to ensure the most up-to-date annotation data. MyGene.info receives roughly 3,000,000 requests per month and has accumulated over 70 million requests. 60% of traffic comes from over 3000 unique external IPs.

The goals for MyVariant.info align with MyGene.info’s capabilities. Most of the core infrastructure of MyGene.info’s backend was carried over, adapted, and implemented into MyVariant.info’s backend. The proven capabilities of MyGene.info’s infrastructure make it an ideal candidate to support the service that MyVariant.info would provide.

My contribution to this project (Figure 1) is encompassed by three tasks: loading data, developing clients to annotate variants, and integrating the service into a discovery pipeline for filtering and prioritizing variants.

(15)

Figure 1. Schematic of my contribution to MyVariant.info. Box A- Dataloading: parsing flatfiles, mapping to JSON structure, and uploading into MongoDB. Box B- developing clients in R (and Python) for making queries to the

MyVariant.info API. Box C- Implementing the MyVariant.info application to annotate variants from a disease study and reveal candidate genes.

D

ATA

L

OADING

Stocking the backend of MyVariant.info with annotation data requires parsing individual databases, structuring the data as JavaScript Object Notation (JSON), and loading into MongoDB with a single python script.

JSON is ideal for variant document storage due to inherent key-value storage structure. JSON storage is also optimal due to its ability to store heterogeneous data and is agnostic to the amount of stored fields. A given variant document may contain one field to several hundred fields of annotations. JSON is also an interoperable data structure that can be easily converted to data types in common programming languages. Similar to MyGene.info,

(16)

commonly used databases are downloadable as a flat-file and parsed into JSON format using python scripts. Each JSON document is loaded into a NoSQL database called MongoDB. MongoDB is an open source document based database that provides high performance, availability, and scalability.

The native storage structure of MongoDB documents is similar to JSON objects where fields are keys and values can be other documents or arrays. SQL and MongoDB terminology and concepts are very similar (Table 1). Primary differences between SQL and MongoDB include table and collection, row and document, column and field, table joins and embedded documents and linking. In SQL, a primary key, which is a unique and immutable identifier, is a column or combination of columns where MongoDB defaults the ‘_id’ field as the primary key.

MongoDB serves as the intermediate database where the documents will be stored, updated, and indexed into ElasticSearch. ElasticSearch is a scalable, open-source, RESTful query engine that provides an API and allows for programmatic access to MyVariant.info annotation data. It provides rich query syntax for retrieving any annotation type. Behind the scenes, the API is further structured on Tornado, a Python-based web framework that allows high concurrency among users.

Table 1. SQL to MongoDB Terms and Concepts Mapping Chart

SQL

MongoDB

database database table collection row document column field index index

table joins embedded documents and linking primary key primary key

Specifiy primary key default _id field to primary key

JSON variant documents are indexed in a python dictionary by Human Genome Variations Society23,24 (HGVS) IDs. HGVS has a defined a consistent nomenclature that articulates a mutation’s position in the genome as well as the change. The identifier is as

(17)

unique as the mutation it serves to describe and allows a researcher to understand the basis of the mutation from the identifier alone. Table 2 provides examples of common variant types and nomenclature. The primary key will allow integration of data annotations on variant HGVS IDs for storing the variant document as well as allowing the user to make queries. This can be done by extracting the information from a VCF (Variant Call Format) file, which contains fields of data to concatenate strings into a proper HGVS ID.

I created parsers for the databases listed in Table 3. Each database is parser is a unique module in the code repository. The input for each parser is a typically a VCF file or flat-file where one variant is a row. The output is a MongoDB collection of JSON variant documents. Documents are then merged on the primary key where database names are keys and available attributes from respective databases are values.

MyVariant.info also allows community contribution by permitting members to import variant annotation resources by python scripts, forking the MyVariant.info Github repository and sending a pull request. Each parser follows a guideline, conforming to a loading

framework with recurring utility functions to allow streamlined loading of variant documents. This outline will be further developed and made publicly available. Table 2. HGVS Nomenclature

Variant Types HGVS Nomenclature Notes

Substitution chr1:g.241T>C single nucleotide substitution Deletion chr1:g.413del single nucleotide deletion

chr1:g.290_297del >1 nucleotide deletion Duplication chr1:g.413dup single nucleotide duplication

chr1:g.692_694dup several nucleotide duplication Insertion chr1:g.451_452insT single nucleotide insertion

chr1:g.451_452insGAGA several nucleotide insertion

chr1:g.777_778insAB012345.1 large insertion with a submitted sequence Inversion chr1:g.1077_1080inv short inversion

(18)

Table 3. Number of Variant Documents in MyVariant.info

Database Documents Loaded

COSMIC 1,024,759 CADD 163,690,986 dbNSFP 84,586,164 dbSnp 110,234,210 Docm 1,119 EMVClass 12,066 EVS 1,977,300 GWASSnps 15,030 MutDb 420,246 Snpedia 5,905 Wellderly 21,240,519 Total 383,208,304 Unique 286,204,690

Figure 2 shows an example of a merged variant document in the browser where annotations are available from multiple databases.

Table 3 displays the number of variant documents that are currently indexed to ElasticSearch, merged by genomic HGVS ID, and queryable. MyVariant.info has been officially deployed as a web service at http://myvariant.info. The backend parsers and architecture code is located at https://github.com/Network-of-BioThings/myvariant.info.

P

ROGRAMMATIC

A

CCESS

There are two MyVariant.info endpoints at which annotations can be retrieved. http://myvariant.info/v1/variant/<variant_id> and

http://myvariant.info/v1/query/?q=<query>. The /variant/ endpoint accepts HGVS ids. In its simplest form, the URL http://myvariant.info/v1/variant/chr1:g.160145907G>T requests the data from Figure 2 from the server. The /query/ endpoint accepts miscellaneous terms that match available fields. The URL http://myvariant.info/v1/query?q=chr1:69000-70000

requests annotations from every variant that lies on chromosome 1, from positions 69,000 to 70,000.

(19)

Figure 2. JSON variant document. Annotations for variant

chr1:g.160145907G>T. Some fields are collapsed for viewing ease. The merged example shows annotations available from multiple sources.

(20)

In a typical genomic analysis of variants, it is possible a user is returned a list of millions of variants to be annotated. For these common cases, clients have been developed to programmatically access MyVariant.info’s services and save the structured annotations.

Several programmatic libraries exist in many programming languages to make HTTP requests. While these are useful for a user to access MyVariant.info web services, its use would be encouraged by readily available clients in common programming languages.

mygene.R

Groundwork for MyVariant.info third party applications has already been structured. I developed an R/Bioconductor25 package called mygene12 that serves as a wrapper for accessing MyGene.info services. Bioconductor is an open source repository for software packages developed in the language of R that aid in genomic analysis. The user is exposed to functions that retrieve annotations given gene IDs, genomic ranges, and many other possible query terms. Returned data structures are interoperable amongst core Bioconductor packages to enable a fluid workflow in a genomic analysis. Similarly, I contributed to the mygene python client.

Figure 326 from Bioconductor.com shows mygene download statistics. mygene has accrued over 1400 downloads from almost 900 distinct IPs since October 2014. Figure 4 shows usage statistics since October, 2014. Approximately 73,000 requests per month have been made using mygene.

Notable users of mygene include The Broad Institute of MIT and Harvard, Johns Hopkins University, Heidelberg EMBL, UC Santa Cruz, UC San Diego, Oxford University, Sanger Institute UK, Johnson & Johnson, the Mayo Foundation in Rochester Minnesota, University of Wisconsin. University of Oklahoma, University of Iowa, and Sloan Kettering. ElasticSearch analytics engine has also detected international usage from the UK, Brazil, Italy, Belgium, France, Greece, China, and Norway.

myvariant.R

Construction of the MyVariant.info clients required just minute adjustments to the mygene code to query the REST API. Functions for both GET and POST requests to each of the endpoints of the API in both commonly used programming languages of R and Python.

(21)

Month Nb of distinct Ips Nb of downloads Oct-14 76 103 Nov-14 179 244 Dec-14 159 195 Jan-15 185 310 Feb-15 153 276 Mar-15 236 302 Apr-15 23 26 All Months 872 1456

Figure 3. mygene download statistics. Source: Bioconductor. Download stats for Software package mygene http://bioconductor.org/packages/stats/ bioc/mygene.html (2015). 0 50 100 150 200 250 300 350 Nb of distinct IPs Nb of downloads

(22)

14 F igu re 4. m yge n e w ee k ly u sage stat istics.

(23)

The R package is tailored to Bioconductor criteria for high-quality, thorough documentation, and interoperability.

Two user-exposed functions were developed to retrieve variant annotations.

getVariant, getVariants. The getVariant service accepts one HGVS ID.

Similarly, getVariants accepts lists, vectors, or strings of HGVS IDs. The fields parameter can accept database names and fields of database names.

Figure 5. getVariant example.

Figure 5 shows how to use the getVariant function to annotate the merged example from Figure 2. By storing the annotations as a variable, the user can access specific fields. Polyphen-2 annotations for predicting deleteriousness of the variant are shown.

Two more user-exposed functions were developed to retrieve annotations give a wildcard query terms following strict syntax. queryVariant, which makes a GET request, accepts a parameter q.

(24)

Figure 6 displays an example of how to retrieve the Ensembl gene ID of all variants documented in the dbNSFP database with polyphen2-hdiv scores greater than 0.99 that reside in chromosome 1 and have been validated in dbSNP. The query functions retrieve from the http://myvariant.info/v1/query/<query> endpoint. Dr. Chunlei Wu developed the backend that allows this type of query through the ElasticSearch python library.

queryVariants, which makes a POST request, accepts a list of annotation types as long as they are defined by the parameter, scopes, to retrieve annotations specified by the fields parameter. For example, given a list of rsIDs from dbSNP,

queryVariants(qterms=ri_id_list, scopes=”dbsnp.rsid”,

fields=”dbnsfp”) will retrieve annotation fields from the database dbNSFP.

Data from POST requests can be returned as R/Bioconductor DataFrame objects. A DataFrame is essentially an interactive table in the R console. This allows for easy sub-setting of variants in order to filter by annotation types and values. The DataFrame structure was developed by Bioconductor to permit entire character lists into one cell where multiple annotations maybe exist for a field.

The R package is available for download at https://github.com/Network-of-BioThings/myvariant.R.

I developed a python module with the same syntax and parameters as required in the R package. Syntax and parameters are precisely the same due to the ElasticSearch API architecture. POST requests can be returned in DataFrame structure as well, which was built in the pandas module for scientific computing. The python module can be downloaded at https://github.com/Network-of-BioThings/myvariant.py.

(25)

CHAPTER 3 VARIANT DISCOVERY

The exome is the part of the genome that is comprised exclusively of exons, the protein coding region of DNA that remains after introns are removed during RNA transcription and splicing. The exome encompasses roughly 1% of the entire genome but harbors approximately 85% of mutations that can affect disease.

Whole exome sequencing (WES) is a technique for which requires sequencing only the exons using high throughput DNA sequencing technology. This technique shows promise for identifying genetic variation and unraveling the genetic basis of human diseases27–32. Ng et al recently elucidated the genes responsible for Miller syndrome and Kabuki syndrome30. Exome sequencing also revealed mutations in SETBP1 as the cause for Schinzel-Giedion syndrome32, a disease characterized by severe mental retardation and facial malformations. However, filtering and prioritizing variants in next generation sequencing studies for their role in disease remains a hurdle. While software tools have been developed to deliver annotations from several databases and existing frameworks to filter genetic variants7,17,20, none provide an interface for querying across multiple databases at a sufficient level to comprehensively annotate and discover novel variants. Here, I demonstrate a framework that leverages MyVariant.info’s services to reveal candidate genes responsible for pathogenesis in rare Mendelian diseases.

K

ABUKI

S

YNDROME

Mendelian disorders are genetic diseases that follow Mendelian patterns of

inheritance. 80% of rare diseases are genetic. According to the National Institute of Health Office of Rare Disease, a disease is considered “rare” if it affects fewer than 200,000 people in the United States population. With approximately 7,000 different types of described rare

(26)

diseases, rare diseases collectively affect 30 million and 350 million people in the United States and the world, respectively.

Kabuki syndrome is a rare disease characterized by skeletal and facial malformations, cardiac anomalies, and mental retardation. Incidence is roughly one in 32,000 about 400 cases have been reported worldwide. Established approaches to gene discovery were previously unsuccessful, potentially due to phenotypic variability. To address this issue, Ng et al30 sequenced the exomes of ten unrelated individuals with Kabuki syndrome from three different populations.

The convention that mutations underlying Kabuki syndrome are rare means they are not likely to be annotated or appear in low allele frequencies in variant databases. The first filtering step required comparing variants against 1000 Genomes Project variant allele

frequencies and dbSNP version 129. Under a dominant model, which means only one variant from a gene must appear in each person, the only gene that was mutated in all ten patients was MUC16. This was considered a false positive due to the extremely large size of the gene. Accordingly, genes were stratified by presence in groups of patients from 1 to 10. Patients were also stratified into groups according to phenotypic similarity and then ranked

subjectively by intragroup similarity. Among groups, shared variants were characterized for functional consequences such as nonsense or nonsynonymous, frame-shift, splice-site disruption, or stop-gained.

The gene MLL2 was identified as a candidate gene due to deleterious consequences of the frame-shift or indel variants in 7 of the 10 patients. Further rationale for concluding MLL2 mutations as a cause for Kabuki syndrome includes genomic evolutionary rate profiling, or GERP, rejection scores. GERP is a conservation score that aligns mammalian sequences to identify likelihood of substitutions. MLL2 variants ranked high in GERP scores among shared genes.

Ng et al had previously elucidated genes responsible for Miller syndrome28, a rare Mendelian disorder clinically similar to Kabuki syndrome. Filtering against databases of common or previously described variants as well as ranking by the functional prediction score SIFT, revealed DHODH as the causal gene for Miller syndrome.

(27)

A

PPLICATION OF

M

Y

V

ARIANT

.R

To demonstrate MyVariant.info’s capabilities as an annotation service, I emulated the workflow by Ng et al (Figure 7). Though the general steps I took for prioritizing variants were similar, most steps were substituted using annotations from more encompassing databases. Some of the technologies I use to annotate variants were not available at the time of the time of Kabuki study.

Methods

The database of Genotype and Phenotype (dbGaP)33,34 granted me permission to obtain the FASTQ files generated by Ng et al for the Kabuki syndrome study. FASTQ files were processed according to the Broad Institute’s best practices35 and performed on the Garibaldi high performance compute cluster at The Scripps Research Institute. At least four replicates from ten patients were sequenced. Individual samples were aligned to the hg19 reference genome using BWA 0.7.10, which implements the Burrows-Wheeler algorithm36 for alignment. Picard 1.103 was used to deduplicate reads from bam files. Patient bam files were then merged using Samtools37 1.0. Variants were then called using GATK 3.338

HaplotypeCaller and quality scores were recalibrated using GATK VariantRecalibrator. VCF files containing variant calls were then copied to my local machine to perform the annotation and analysis.

Variants called by GATK are output to a VCF (variant call format) file. VCF has become a standardized format developed by 1000 Genomes Project Analysis Group and adopted by several other projects. VCF files are text files that contain lines of meta

information about annotation fields, a header line that contains 8 mandatory common fields, CHROM (chromosome), POS (position), ID (gene ID), REF (reference allele), ALT

(alternate allele), QUAL (phred quality scores), FILTER (pass or a semicolon separated list of filters it failed), and an INFO column that contains a string of semicolon-separated series of keys with optional values.

I developed a function in the myvariant R package called readVcf to read in a VCF file to the global environment. Due to the complicated nature of HGVS nomenclature, I developed functions to extract HGVS ids based on the fields from the VCF file. The columns CHROM, POS, REF, and ALT are used to concatenate information and return the properly

(28)

structured identifiers. readSnps, readDels, readIns, and readIndels are available to read in SNPs deletions, insertions, and insertion-deletion variants, respectively.

I used the base R function lapply to read in all VCF files simultaneously and make queries to MyVariant.info. DataFrames were returned for all subjects containing matched annotations and NA values where there is missing data for fields.

Each DataFrame contains between 32,000 and 46,000 rows corresponding to variants. The first step requires filtering variants observed in the common population. The R call subset(i, j) is used to select rows from DataFrame i, where j is a logical of conditions needed to be met in order to be returned.

Results

The left side of Figure 7 shows Ng’s workflow whereas the right side shows the workflow implemented using the MyVariant.info R client to subset common variants by excluding anything that appears in the Exome Aggregation Consortium at an allele frequency of 0.05.

The next command filtered down variants that are nonsynonymous (NS), stop codon (SC) gained or lost, or splice site (SS) affecting.

Figure 7.Workflow to reveal candidate genes in Kabuki syndrome.

The remaining candidate genes were then ranked according to the CADD score of the variants they harbor. The CADD functional impact algorithm was objectively deemed

(29)

appropriate for this study for its ability to distinguish disease-associated alleles from rare, benign variants6.

Table 4 displays the number of shared genes by at least x number of patients from x=4 through x=10 after each of the filters mentioned above. Fewer genes are shared in common as the amount of patients increases. At the bottom of the table, the top gene’s mean scaled phred scores are displayed. MLL2 SNPs appear in 5 of the patients. A scaled CADD score of 10 means the variant is in the top 10% of deleterious variants. A score of 20 means the variant is in the top 1% of deleterious variants, etc. The MLL2 variants are on average roughly five times more deleterious than the next highest ranking variants in HYDIN. Table 4. Number of Genes Shared by at Least x=4,...x=10 Patients

Number of genes shared by at least x=4,...x=10 patients

Filter Applied 10 9 8 7 6 5 4

None 562 1410 2296 3103 3854 4590 5368

ExAc af < 0.05 5 9 25 39 63 110 219

NS/SS/SC 4 7 17 31 52 85 158

Top Gene HYDIN HYDIN HYDIN HYDIN HYDIN MLL2 MLL2 Mean CADD of Top Gene 32.26 32.26 32.26 32.26 32.26 39.46 39.46

Table 5 shows the mean scaled CADD scores of variants present in (top ten highest ranked) genes that are shared by 5 or more patients.

Table 5. Mean CADD Scores of Candidate Disease Genes

Gene Scaled CADD score

MLL2 39.46 HYDIN 32.26 CTBP2 29.27 MTCH2 26.00 DNAH11 25.50 CDC27 24.52 SYNE1 23.77 EPSTI1 22.60 CTDSP2 21.88 FRG1 19.16

(30)

Specific annotations can be selected from each of the DataFrames just as simply. Table 6 displayed the eight variants from MLL2. Five patients each had stop codon gain mutations. The substitution causes a codon change to a stop codon, which stops transcription of the gene. One of those patients harbored two additional but less severe, nonsynonymous mutations that may be additive toward pathogenicity. One patient harbored one stop-gain mutation and one nonsynonymous mutation. The extreme severity of MLL2 CADD scores suggests it may be causal of Kabuki syndrome. The minor allele frequency, extracted from ExAC, often correlates with the deleteriousness of the variant. The most deleterious variants are nonexistent in the common population, also indicative of the potential for these variants to be the Kabuki syndrome variants. Indels were not detected in MLL2, indicating that perhaps GATK HaplotypeCaller did not identify any insertion or deletion events in MLL2, which potentially accounts for the remaining of the 7 patients reported by Ng et al to harbor disease mutations in MLL2.

Further investigation into HYDIN, CTBP2, MTCH, and CDC27 are required to eliminate them from the list of candidate genes. As suggested by Ng, less stringent filtering may be required for more candidate genes to present themselves. The use of MyVariant.info allows us to make the same conclusion in a more efficient and comprehensive manner. Data available through MyVariant.info encourages further analysis to potentially reveal the remaining genes causal of Kabuki syndrome.

Table 6. Variant Consequences by Patient

MLL2 Variant CADD Consequence MAF Patient ID

chr12:g.49420554C>T 57.00 Stop-Gained 0.000 04 chr12:g.49426460A>G 9.81 Nonsynonymous 0.0264 04 chr12:g.49448463C>T 22.1 Nonsynonymous 0.0258 04 chr12:g.49432651G>A 37.00 Stop-Gained 0.000 05 chr12:g.49435971G>A 18.31 Stop-Gained 0.000 06 chr12:g.49425575C>T 22.1 Nonsynonymous 0.00136 08 chr12:g.49435258G>A 41.00 Stop-Gained 0.000 08 chr12:g.49425791G>A 44.00 Stop-Gained 0.000 10

(31)

CHAPTER 4 DISCUSSION

As discussed in its usage case for novel variant discovery in rare diseases,

MyVariant.info provides annotations from a variety of population studies, functional impact prediction algorithms, genome, clinical, and phenotype annotations to allow a comprehensive analysis of genetic variants. Due to the highly penetrant and damaging nature of variants causal of rare monogenic disease, candidate genes can be revealed rather seamlessly.

However, I envision MyVariant.info as an indispensible tool for analysis of complex diseases such as cancer as well.

Within the R or python console, sub-setting variants has proved an easy task. Whether studying complex or simple diseases, workflows align in principle. In more

common disease such as cancer, rather than filtering out common variants completely, allele frequency cutoffs can be selected. This operation can be performed with one line of code. MyVariant.info distinguishes itself in dynamic variant filtering as well. In the case of Kabuki syndrome where filtering common variants proved to be too stringent, we can allow

ourselves to instantly address this issue by reapplying functions on saved data structures. Further, some databases like CADD provide whole genome annotations to allow us to select subsets of variants by functional consequences such as only missense or nonsense. In

complex diseases, gene expression may be altered due to variants present in regulatory or other untranslated regions. These phenomenon may easily be addressed by utilizing annotations from CADD to subset and explore these regions.

The working version of MyVariant.info required a collaborative effort from experts in information technology as well as biology. From a bioinformatics standpoint, planned work includes continual addition of annotation data. The entire Exome Aggregation Consortium dataset will be an invaluable data set to be used as control samples in whole exome

(32)

nonsynonymous variants from ExAC data are available through dbNSFP. Also,

MyVariant.info only contains 1000 Genome Project data available through CADD and dbNSFP. Because the only annotations indexed from CADD are coding transcript

annotations and nonsynonymous SNP data from dbNSFP, we are missing indels and non-coding or intergenic data from 1000 Genomes Project.

Databases are continuously updated with new data and retracted data, which becomes a burden on a researcher. As per MyGene.info infrastructure, plans are in development to automate the process of checking for updates in our core set of databases, downloading, and indexing into MyVariant.info.

Currently, a validation procedure is in place to ensure all SNPs indexed to

MyVariant.info are mapped to the hg19 reference build correctly. Validation for insertion, deletion, duplication, and other variants is being developed to guarantee integrity of the data.

C

ONCLUSION

Software that delivers annotations from aggregated resources is not a novel concept but MyVariant.info will discern itself from other services with several resolutions: (1) an open framework that allows for community contribution (2) a programmatic interface that allows interactive exploration of annotations and (3) the ability to query variant annotations given a miscellany of query terms. This infrastructure will prove to be an indispensible strategy for performing comprehensive analyses for biomedical discovery.

(33)

REFERENCES

1. Tennessen, J. A. et al. Evolution and functional impact of rare coding variation from deep sequencing of human exomes. Science 337, 64–69 (2012).

2. Abecasis, G. R. et al. A map of human genome variation from population-scale sequencing. Nature 467, 1061–73 (2010).

3. Adzhubei, I. A. et al. A method and server for predicting damaging missense mutations. Nat. Methods 7, 248–249 (2010).

4. Shihab, H. A. et al. Predicting the functional, molecular, and phenotypic consequences of amino acid substitutions using Hidden Markov Models. Hum. Mutat. 34, 57–65 (2013).

5. Schwarz, J. M., Rödelsperger, C., Schuelke, M. & Seelow, D. MutationTaster evaluates disease-causing potential of sequence alterations. Nature Methods 7, 575–576 (2010). 6. Kircher, M. et al. A general framework for estimating the relative pathogenicity of

human genetic variants (Supplement). Nat. Genet. 46, 310–5 (2014).

7. McLaren, W. et al. Deriving the consequences of genomic variants with the Ensembl API and SNP Effect Predictor. Bioinformatics 26, 2069–2070 (2010).

8. Kumar, P., Henikoff, S. & Ng, P. C. Predicting the effects of coding non-synonymous variants on protein function using the SIFT algorithm. Nat. Protoc. 4, 1073–1081 (2009).

9. Reva, B., Antipin, Y. & Sander, C. Predicting the functional impact of protein mutations: application to cancer genomics. Nucleic Acids Res. 39, (2011). 10. Garber, M. et al. Identifying novel constrained elements by exploiting biased

substitution patterns. Bioinformatics 25, i54–i62 (2009).

11. Davydov, E. V. et al. Identifying a high fraction of the human genome to be under selective constraint using GERP++. PLoS Comput. Biol. 6, e1001025 (2010). 12. Chun, S. & Fay, J. C. Identification of deleterious mutations within three human

genomes. Genome Res. 19, 1553–1561 (2009).

13. Liu, X., Jian, X. & Boerwinkle, E. dbNSFP: a lightweight database of human nonsynonymous SNPs and their functional predictions. Hum. Mutat. 32, 894–899 (2011).

14. Landrum, M. J. et al. ClinVar: public archive of relationships among sequence variation and human phenotype. Nucleic Acids Res. 42, D980–5 (2014).

(34)

15. Forbes, S. A. et al. COSMIC: mining complete cancer genomes in the catalogue of somatic mutations in cancer. Nucleic Acids Res. 39, D945–50 (2011).

16. Sherry, S. T. et al. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 29, 308–311 (2001).

17. Wang, K., Li, M. & Hakonarson, H. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res. 38, e164 (2010). 18. Kuhn, R. M., Haussler, D. & James Kent, W. The UCSC genome browser and

associated tools. Brief. Bioinform. 14, 144–161 (2013).

19. Fujita, P. A. et al. The UCSC genome browser database: update 2011. Nucleic Acids Res. 39, D876-882 (2011).

20. Cingolani, P. et al. A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w 1118; iso-2; iso-3. Fly (Austin) 6, 80–92 (2012).

21. Rosenbloom, K. R. et al. ENCODE whole-genome data in the UCSC Genome Browser: update 2012. Nucleic Acids Res. 40, D912-917 (2012).

22. Wu, C., Mark, A. & Su, A. I. MyGene.info: Gene Annotation Query as a Service. (The Scripps Reasearch Institute, 2014).

23. Horaitis, O. & Cotton, R. G. H. The challenge of documenting mutation across the genome: the human genome variation society approach. Human Mutation 23, 447–452 (2004).

24. Hart, R. K. et al. Sequence analysis A Python package for parsing, validating, mapping and formatting sequence variants using HGVS nomenclature. Bioinformatics 31, 268– 270 (2015).

25. Gentleman, R. C. et al. Bioconductor: open software development for computational biology and bioinformatics. Genome Biol. 5, R80 (2004).

26. Bioconductor. Download stats for Software package mygene http://bioconductor.org/packages/stats/bioc/mygene.html (2015).

27. Worthey, E. A. et al. Making a definitive diagnosis: successful clinical application of whole exome sequencing in a child with intractable inflammatory bowel disease. Genet. Med. 13, 255–62 (2011).

28. Ng, S. B. et al. Exome sequencing identifies the cause of a mendelian disorder. Nat. Genet. 42, 30–5 (2010).

29. Ng, S. B. et al. Targeted capture and massicely parallel sequencing of twelve human exomes. Nature 461, 272–276 (2010).

30. Ng, S. B. et al. Exome sequencing identifies MLL2 mutations as a cause of Kabuki syndrome. Nat. Genet. 42, 790–793 (2010).

31. Choi, M. et al. Genetic diagnosis by whole exome capture and massively parallel DNA sequencing. Proc. Natl. Acad. Sci. U. S. A. 106, 19096–101 (2009).

(35)

32. Hoischen, A. et al. De novo mutations of SETBP1 cause Schinzel-Giedion syndrome. Nat. Genet. 42, 483–5 (2010).

33. Mailman, M. D. et al. The NCBI dbGaP database of genotypes and phenotypes. Nat. Genet. 39, 1181–1186 (2007).

34. Tryka, K. A. et al. NCBI’s database of genotypes and phenotypes: DbGaP. Nucleic Acids Res. 42, D975-D979 (2014).

35. Van der Auwera, G. A. et al. From fastQ data to high-confidence variant calls: the genome analysis toolkit best practices pipeline. Curr. Protoc. Bioinforma. 43, 1-11 (2013). doi:10.1002/0471250953.bi1110s43

36. Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754–1760 (2009).

37. Li, H. et al. The sequence alignment/map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).

38. McKenna, A. et al. The genome analysis toolkit: a MapReduce framework for

Building Biomedical Infrastructure for Aggregating Variant Annotations and Prioritizing Disease Variants

ABSTRACT OF THE THESIS

TABLE OF CONTENTS

LIST OF TABLES

LIST OF FIGURES

ACKNOWLEDGEMENTS

CHAPTER 1

INTRODUCTION

A

D

Population Databases

Functional Impact

Clinical and Phenotype

Annotation Tools

CHAPTER 2

MYVARIANT.INFO

D

L

SQL

MongoDB

P

A

mygene.R

myvariant.R

CHAPTER 3

VARIANT DISCOVERY

K

S

A

M

V

.R

Methods

Results

CHAPTER 4

DISCUSSION

C

REFERENCES