Leveraging variant annotations for WGS data analysis

(1)

Leveraging variant annotations for WGS

data analysis

Stephanie Gogarten

1

(2)

Outline

• What are variant annotations?

• Why do we need annotations? • Where can I find annotations?

• How do I use annotations in my analysis?

(3)

What is annotation?

• Annotation is a qualitative or quantitative information about the variant or its position

• Annotations can be

• _{variant dependent}_{(dependent on chrom, position, reference and alternate allele)}

• _{variant independent}_{(dependent only on chromosome and position)}

3

chromosome position Reference

(4)

Annotating variants = generating google maps for genome

4

Seattle, WA, USA

Chr 22: 50552604

Name of the city provides some context about your

location on Earth! Chromosome and position provides some context about your location in the genome

(5)

Annotating variants = generating google maps for genome

5

Chr 22: 50552604

SYCE3

Building outlines overlaid indicate you are in a

building Gene name annotations identify that the variant overlaps with SYCE3 gene

Chr 22

(6)

Annotating variants = generating google maps for genome

6

Chr 22: 50552604, rs131777

SYCE3

rs identifier and GWAS catalogue annotations help you identify that this variant is previously associated with red cell trait “Mean corpuscular volume”

Roads overlaid show paths you can take to go from point A to B

Chr 22

(7)

Annotating variants = generating google maps for genome

7

Chr 22: 50552604

SYCE3

Regulatory annotations help you identify that the variant

overlaps with a regulatory element. The overlapping regulatory element is active in* _:

• Red cell cells • Platelets

• Not in brain cells and bladder cells Road and building names give you more details

about your location

*_{hypothetical example}

Chr 22

(8)

Maps evolve as we learn more!

8

(9)

Outline

• What are variant annotations?

• Why do we need annotations?

• Where can I find annotations?

(10)

Applications of annotations in rare variant association testing

1. Boost power in rare-variant association testing

• _{Define genomics regions to group variants over}

• _{Filter variants within grouped variant sets}

• _{Use annotation as weights}

2. Interpret signals identified in GWAS

• _{Predict causal variants}

• _{Develop hypotheses about biological mechanisms explaining variant-trait associations}

(11)

Aggregating variants boosts power

• Majority of variants from WGS studies are rare (MAF<0.05)

• Single variant analysis of rare variants lack power due to

• _{an increased multiple testing burden}

• _{a decrease in statistical power owing to the}

rarity of individuals carrying these variant alleles

• To gain power, rare variants are grouped using annotations

11

(12)

Applications of annotations in rare variant association testing

• _{Filter variants within grouped variant sets} • _{Use annotation as weights}

(13)

Why filter variants within grouped variant sets?

• Annotations are used to filter or weight variants in association tests

• _{Down weight neutral/non-signal variants} • _{Up-weight functional/deleterious signal}

variants

• _{Good filtering strategy increases the}

signal to noise ratio & increases power to detect an association

(14)

Applications of annotations in rare variant association testing

• _{Filter variants within grouped variant sets}

• _{Use annotation as weights}

14

(15)

Outline

• What are variant annotations? • Why do we need annotations?

• Where can I find annotations?

(16)

What is the source of annotations?

• Lots of resources!

Annotation can be generated by any lab or consortium

• _NCBI

• _Ensemble • _UCSC

• _{ENCylopedia Of DNA Elements (ENCODE)} • _{Roadmap Epigenomics Consortium}

• _FANTOM5 • _dbSNP

• _…

(17)

Annotation software

• Bioconductor AnnotationHub

• ANNOVAR

• SnpEff

• VEP (Variant Effect Predictor)

• Many more...

Can’t decide? Try WGSA, which runs many different annotaters and combines the results!

(18)

Whole Genome Sequence Annotator (WGSA)

• WGSA gathers annotations from various sources • WGSA is provided both as

• _{an Amazon Machine Image (AMI) ready to run out-of-the-box and} • _{a downloadable version}

• Licenses are required for non-academic usage for some of the resources • Website: https://sites.google.com/site/jpopgen/wgsa/

(19)

WGSA has over 1,500 annotations

• _{Gene based location and consequence} • _{Softwares : SnpEff, ANNOVAR, VEP} • _{Gene models: Ensembl ,RefSeq ,UCSC}

• _{Transcript-specific annotation (transcript name, consequence etc.)} • _{Loss-of-function annotations (eg: LOFTEE)}

• _{Deleteriousness predictions (CADD, MetaSVM, ssSNV etc)} • _{Allele frequencies (1000G, UK10K, EXAC, gnomAD etc)} • _{Regulatory annotations (ENCODE, Roadmap, FANTOM5)} • _{Conservation scores (GERP etc)}

• _{Mappability scores} • _rsIDs

• _{Many more ….}

(20)

Outline

• Why do we need annotations? • What are variant annotations? • Where can I find annotations?

(21)

Two steps involved in generating aggregated variant list for

association testing

21

STEP 1: Define aggregation units

• which genomic regions will be included in each unit STEP 2: Decide on filtering criteria

• which variants will be filtered within each unit

(22)

STEP 1: Define aggregation units

22

(23)

Gene based aggregation units

• Gene and/or gene related elements are the unit of aggregation • Multiple gene models available

• _{GENCODE/Ensembl, RefSeq and UCSC}

• Multiple releases of a given gene model are available for same genome build

• _{GENCODE v24,v26 etc. on same genome build}

• Gene models and gene model specific annotations may change a given gene model version

• _{Track gene model and its specific version for reproducible analyses}

(24)

Gene as the aggregation unit

• Gene is the contiguous genomic region spanning all the transcripts of a gene

24

Biotype Unique Ensembl identifier Genomic coordinates

Gene ENSG00000179348 Chromosome 3: 128,479,427-128,493,185

Transcript ENST00000341105 Chromosome 3: 128,479,427-128,493,185

Transcript ENST0000043026 Chromosome 3: 128,480,146-128,487,916

(25)

Gene is regulated by non-coding regulatory elements

Modified from R. Searle and P. M. Hopkins, 2009

Insulator

25

(26)

Example gene-based aggregation units

• _Gene

• _{Gene + flanking regions}

• _{Gene + enhancer(s) + promoter} • _{UTR’s + enhancer(s) + promoter} • _{Promoter of a gene}

• _{First intron of a transcript}

(27)

Other approaches for aggregating variants

Aggregation units are defined based on genomic positions and they can be: 1. Contiguous units of aggregation

• _{Sliding window} • _Gene

• _Transcript • _Exons

• _Introns

• _{Regulatory regions (promoters, enhancers, etc.)} • _{Topologically associated domains (TAD’s)}

2. Non-contiguous units of aggregation

• _{Gene/Transcript + its associated regulatory regions} • _{Domains of interacting proteins}

• _{Genes in a pathway}

Any other biologically motivated unit of your choice …

(28)

STEP 2: Filtering aggregation units

(29)

Filtering variants in aggregation units

• Good filtering strategy increases the signal to noise ratio & increases power to detect an association

• Typically, one or a combination of annotations are used for filtering

• _{Example: Within a gene keep variants that}

a) Are frameshift mutations or

b) Overlap Genomic Evolutionary Rate Profiling (GERP) score > 0 or c) Overlap with transcript factor binding sites in blood cells

• Quantitative annotations can be used as weights • It is hard to predict the best filtering strategy

• A very active area of research

(30)

Example: filtering using functional consequence

• Genic unit

Transcript range + 20 kb flanking region upstream and downstream • Filters:

FATHMM-MKL>=0.5 and MAF<=1%

Functional Analysis Through Hidden Markov Models -Multiple Kernel Learning (FATHMM-MKL)

• FATHMM-MKL generates scores predicting functional consequences of both coding and non-coding sequence. FATHMM-MKL is a machine learning approach that integrates functional annotations from ENCODE with nucleotide based sequence conservation measures variants. • FATHMM-XF (FATHMM with eXtended Features) represents a substantial improvement over

FATHMM-MKL

30

(31)

Filtering variants in aggregation units

• Generate variant sets with one or more filtering approaches

• Evaluate the distributions of variants within each aggregation unit

• Can choose a less-, medium- and more-stringent filtered set