• No results found

Leveraging variant annotations for WGS data analysis

N/A
N/A
Protected

Academic year: 2021

Share "Leveraging variant annotations for WGS data analysis"

Copied!
31
0
0

Loading.... (view fulltext now)

Full text

(1)

Leveraging variant annotations for WGS

data analysis

Stephanie Gogarten

1

(2)

Outline

• What are variant annotations?

• Why do we need annotations? • Where can I find annotations?

• How do I use annotations in my analysis?

(3)

What is annotation?

• Annotation is a qualitative or quantitative information about the variant or its position

• Annotations can be

variant dependent (dependent on chrom, position, reference and alternate allele)

variant independent (dependent only on chromosome and position)

3

chromosome position Reference

(4)

Annotating variants = generating google maps for genome

4

Seattle, WA, USA

Chr 22: 50552604

Name of the city provides some context about your

location on Earth! Chromosome and position provides some context about your location in the genome

(5)

Annotating variants = generating google maps for genome

5

Chr 22: 50552604

SYCE3

Building outlines overlaid indicate you are in a

building Gene name annotations identify that the variant overlaps with SYCE3 gene

Chr 22

(6)

Annotating variants = generating google maps for genome

6

Chr 22: 50552604, rs131777

SYCE3

rs identifier and GWAS catalogue annotations help you identify that this variant is previously associated with red cell trait “Mean corpuscular volume”

Roads overlaid show paths you can take to go from point A to B

Chr 22

(7)

Annotating variants = generating google maps for genome

7

Chr 22: 50552604

SYCE3

Regulatory annotations help you identify that the variant

overlaps with a regulatory element. The overlapping regulatory element is active in* :

• Red cell cells • Platelets

• Not in brain cells and bladder cells Road and building names give you more details

about your location

*hypothetical example

Chr 22

(8)

Maps evolve as we learn more!

8

(9)

Outline

• What are variant annotations?

• Why do we need annotations?

• Where can I find annotations?

• How do I use annotations in my analysis?

(10)

Applications of annotations in rare variant association testing

1. Boost power in rare-variant association testing

Define genomics regions to group variants over

Filter variants within grouped variant sets

Use annotation as weights

2. Interpret signals identified in GWAS

Predict causal variants

Develop hypotheses about biological mechanisms explaining variant-trait associations

(11)

Aggregating variants boosts power

• Majority of variants from WGS studies are rare (MAF<0.05)

• Single variant analysis of rare variants lack power due to

an increased multiple testing burden

a decrease in statistical power owing to the

rarity of individuals carrying these variant alleles

• To gain power, rare variants are grouped using annotations

11

(12)

Applications of annotations in rare variant association testing

1. Boost power in rare-variant association testing

Define genomics regions to group variants over

Filter variants within grouped variant setsUse annotation as weights

2. Interpret signals identified in GWAS

Predict causal variants

Develop hypotheses about biological mechanisms explaining variant-trait associations

(13)

Why filter variants within grouped variant sets?

• Annotations are used to filter or weight variants in association tests

Down weight neutral/non-signal variantsUp-weight functional/deleterious signal

variants

Good filtering strategy increases the

signal to noise ratio & increases power to detect an association

(14)

Applications of annotations in rare variant association testing

1. Boost power in rare-variant association testing

Define genomics regions to group variants over

Filter variants within grouped variant sets

Use annotation as weights

2. Interpret signals identified in GWAS

Predict causal variants

Develop hypotheses about biological mechanisms explaining variant-trait associations

14

(15)

Outline

• What are variant annotations? • Why do we need annotations?

• Where can I find annotations?

• How do I use annotations in my analysis?

(16)

What is the source of annotations?

• Lots of resources!

Annotation can be generated by any lab or consortium

NCBI

EnsembleUCSC

ENCylopedia Of DNA Elements (ENCODE)Roadmap Epigenomics Consortium

FANTOM5dbSNP

(17)

Annotation software

• Bioconductor AnnotationHub

• ANNOVAR

• SnpEff

• VEP (Variant Effect Predictor)

• Many more...

Can’t decide? Try WGSA, which runs many different annotaters and combines the results!

(18)

Whole Genome Sequence Annotator (WGSA)

• WGSA gathers annotations from various sources • WGSA is provided both as

an Amazon Machine Image (AMI) ready to run out-of-the-box anda downloadable version

• Licenses are required for non-academic usage for some of the resources • Website: https://sites.google.com/site/jpopgen/wgsa/

(19)

WGSA has over 1,500 annotations

Gene based location and consequenceSoftwares : SnpEff, ANNOVAR, VEPGene models: Ensembl ,RefSeq ,UCSC

Transcript-specific annotation (transcript name, consequence etc.)Loss-of-function annotations (eg: LOFTEE)

Deleteriousness predictions (CADD, MetaSVM, ssSNV etc)Allele frequencies (1000G, UK10K, EXAC, gnomAD etc)Regulatory annotations (ENCODE, Roadmap, FANTOM5)Conservation scores (GERP etc)

Mappability scoresrsIDs

Many more ….

(20)

Outline

• Why do we need annotations? • What are variant annotations? • Where can I find annotations?

• How do I use annotations in my analysis?

(21)

Two steps involved in generating aggregated variant list for

association testing

21

STEP 1: Define aggregation units

• which genomic regions will be included in each unit STEP 2: Decide on filtering criteria

• which variants will be filtered within each unit

(22)

STEP 1: Define aggregation units

22

(23)

Gene based aggregation units

• Gene and/or gene related elements are the unit of aggregation • Multiple gene models available

GENCODE/Ensembl, RefSeq and UCSC

• Multiple releases of a given gene model are available for same genome build

GENCODE v24,v26 etc. on same genome build

• Gene models and gene model specific annotations may change a given gene model version

Track gene model and its specific version for reproducible analyses

(24)

Gene as the aggregation unit

• Gene is the contiguous genomic region spanning all the transcripts of a gene

24

Biotype Unique Ensembl identifier Genomic coordinates

Gene ENSG00000179348 Chromosome 3: 128,479,427-128,493,185

Transcript ENST00000341105 Chromosome 3: 128,479,427-128,493,185

Transcript ENST0000043026 Chromosome 3: 128,480,146-128,487,916

(25)

Gene is regulated by non-coding regulatory elements

Modified from R. Searle and P. M. Hopkins, 2009

Insulator

25

(26)

Example gene-based aggregation units

Gene

Gene + flanking regions

Gene + enhancer(s) + promoterUTR’s + enhancer(s) + promoterPromoter of a gene

First intron of a transcript

(27)

Other approaches for aggregating variants

Aggregation units are defined based on genomic positions and they can be: 1. Contiguous units of aggregation

Sliding windowGene

TranscriptExons

Introns

Regulatory regions (promoters, enhancers, etc.)Topologically associated domains (TAD’s)

2. Non-contiguous units of aggregation

Gene/Transcript + its associated regulatory regionsDomains of interacting proteins

Genes in a pathway

Any other biologically motivated unit of your choice …

(28)

STEP 2: Filtering aggregation units

(29)

Filtering variants in aggregation units

• Good filtering strategy increases the signal to noise ratio & increases power to detect an association

• Typically, one or a combination of annotations are used for filtering

Example: Within a gene keep variants that

a) Are frameshift mutations or

b) Overlap Genomic Evolutionary Rate Profiling (GERP) score > 0 or c) Overlap with transcript factor binding sites in blood cells

• Quantitative annotations can be used as weights • It is hard to predict the best filtering strategy

• A very active area of research

(30)

Example: filtering using functional consequence

• Genic unit

Transcript range + 20 kb flanking region upstream and downstream • Filters:

FATHMM-MKL>=0.5 and MAF<=1%

Functional Analysis Through Hidden Markov Models -Multiple Kernel Learning (FATHMM-MKL)

• FATHMM-MKL generates scores predicting functional consequences of both coding and non-coding sequence. FATHMM-MKL is a machine learning approach that integrates functional annotations from ENCODE with nucleotide based sequence conservation measures variants. • FATHMM-XF (FATHMM with eXtended Features) represents a substantial improvement over

FATHMM-MKL

30

(31)

Filtering variants in aggregation units

• Generate variant sets with one or more filtering approaches

• Evaluate the distributions of variants within each aggregation unit

• Can choose a less-, medium- and more-stringent filtered set

References

Related documents

Dynamic behaviors of a non selective harvesting single species stage structured system incorporating partial closure for the populations Xiao and Lei Advances in Difference Equations

Zhao, Heat transfer performance of a horizontal micro grooved heat pipe using.

In this study, an algorithm is proposed to estimate the mechanical impedance (MI) between an actuator of the CPR machine and the chest of the patient, and to estimate the

Both the new trends in retail and commerce organization and the technological innovation in supply chain management and distribution planning have led decision makers to

Conclusion This study concludes that MASI score was significantly reduced in hydroquinone group but cases in topical tranexamic acid group developed less side effects..

It is shown that the anatomical structure o f the leaves of th e new Pentameris species in particular, both transverse sections and abaxial epidermal scrapes, resembles closely that

Figure 2 (top) shows, as a function of the number of worker nodes, the 50th, 90th, and 99th percentiles of the job completion time distribution for short jobs for Eagle, normalized

In this dissertation, we propose a low-cost multi-failure resilient replication scheme to ensure high data availability while reducing the cost (i.e., consistency maintenance cost