• No results found

Removing reference mapping biases using limited or no genotype data identifies allelic differences in protein binding at disease-associated loci

N/A
N/A
Protected

Academic year: 2020

Share "Removing reference mapping biases using limited or no genotype data identifies allelic differences in protein binding at disease-associated loci"

Copied!
7
0
0

Loading.... (view fulltext now)

Full text

(1)

Supplemental Materials

Removing reference mapping biases using limited or no genotype data identifies

allelic differences in protein binding at disease-associated loci

Martin L. Buchkovich, Karl Eklund, Qing Duan, Yun Li,Karen L. Mohlke and Terrence S. Furey

 

Supplemental Figure Legends

Figure S1. Reference mapping biases influence sequence alignment. When mapping reads

to a reference genome containing a single allele at heterozygous sites (left), sequence reads containing the reference allele (C, blue) are more likely to map correctly than reads containing the non-reference allele (A, red), especially in the presence of sequencing errors (orange). Reads containing each allele are equally as likely to map correctly when mapping reads to sequence containing both alleles at heterozygous sites (right).

Figure S2. False positive imbalance sites have more significant p-values. Boxplot of

P-values at sites of allelic imbalance using complete (left), and partial (middle) genotypes and common variants (right). For the complete genotype alignment, P-values are further subdivided by inclusion in partial genotypes and common variants. For partial genotypes and common variant alignments, all P-values are displayed in addition to being divided into inclusion during alignment and predicted from the alignment, and then further divided into whether or not the sites was predicted using complete genotypes (false positive vs true positive). Subdivided groups with significant differences in P-value distributions (Mann Whitney U test; P<.01 and P<.001) are indicated.

(2)

Figure S3. Read length influences sequence alignment and allelic imbalance detection

at heterozygous sites. (A) Alignment statistics for alignments of CREB1 ChIP-seq data, and

when using different (B) read lengths and (C) sequence depths, plotted as percent of the 50 bp statistics. (D) Histogram of the number of reads containing the underrepresented allele at sites with significant imbalance (binomial P<.01, uncorrected) with 2 or more reads containing each allele. Vertical dashed line indicates the minimum of 5 or more reads containing each allele required for imbalance detection in (A).

Figure S4. Correlation of alignment statistics and number of imbalances detected. The

number of sites of allelic imbalance in thirteen ChIP-seq and one DNase-seq dataset compared to the (A) total number of reads aligned; (B) number of reads aligned to heterozygous sites; (C) total number of bases aligned (read length x total reads aligned); the percent of genome with greater than (D) 1X and (E) 10X coverage; (F) the average read depth at bases with 1X or more coverage; (G) the ratio of sites with 10X coverage to 1X coverage and (H) the number of

heterozygous sites identified. Pearson correlation R2 values for each statistic and number of

allelic imbalances are displayed for all data and only ChIP-seq data.

Figure S5. Allelic differences in binding at sites without predicted allelic imbalance.

EMSA using purified CREB1 and labeled probes containing each allele at five sites with reads mapping but not significant allelic imbalance to test for allelic differences in binding. Alleles colored blue are the reference allele and alleles colored red are the other allele. Allelic

differences in protein binding were detected at two sites (starred) without predicted imbalance but with predicted allelic differences in the presence of CREB1 motif. Only CREB1-bound probe is shown. 

(3)

GGC

G

TCTG

C

AGCAATTA

GGCCTCTG

C

AGC

T

ATTA

GGCCTCTG

A

AGCAATTA

GGCCTCTG

C

AGCAATTA

G

A

CCTCTG

A

AGCAATTA

GGCCTCTG

A

AGCAA

G

TA

GGC

G

TCTG

C

AGCAATTA

GGCCTCTG

C

AGC

T

ATTA

GGCCTCTG

A

AGCAATTA

GGCCTCTG

C

AGCAATTA

G

A

CCTCTG

A

AGCAATTA

GGCCTCTG

A

AGCAA

G

TA

Alignment to single allele Alignment to both alleles

0 or 1 mismatches

Not Aligned

2 or more mismatches

GGCCTCTG

C

AGCAATTA

GGCCTCTG

C

AGCAATTA

GGCCTCTG

A

AGCAATTA

Reference

Sequence

Aligned

Heterozygous Site

C

/

A

Heterozygous Site

C

/

A

Figure S1

(4)

45 40 35 30 25 20 15 10 5 0

Complete TotalPartial GenotypesKnown Predicted Total Common VariantsMAF≥.05 MAF<.05

Total Partial Common

Allelic imbalance P -values (binomial P <.01) (-log( P )) ** ** * *

* Mann Whitney U P<.05 ** Mann Whitney U P<.01 *

All Variants Included Variants Variants not Included True Positives False Positives

(5)

Factor Condition Reads at Heterozygous Sites Allelic Imbalances CREB1 50 bp 33,599,679 1,295,901 200 35 bp 890,153 106 20 bp 27,981,208 454,580 26 Sampled 70% 23,518,477 907,345 123 Sampled 40% 13,442,901 518,519 45 CREB1 CREB1 CREB1 CREB1 32,368,677

Total Reads Aligned

Number of reads with underrepresented allele

Number of imbalanced sites

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 10 15 20 25 30 35 40 45 50 55 5 0 CREB1 50 bp reads CREB1 35 bp reads CREB1 20 bp reads Read Length

Percent of 50 bp Alignment Statistics

50 bp 35 bp 20 bp 0 10 20 30 40 50 60 70 80 90 100

Percent of Reads Aligned

Percent Alignment Statistics for All Reads

100% 70% 40% 0 10 20 30 40 50 60 70 80 90 100

Total Reads Aligned Reads at Heterozyogus Sites Expected Aligned Bases Allelic Imbalances

Total Reads Aligned Reads at Heterozyogus Sites Expected Aligned Bases Allelic Imbalances

A

B

C

D

(6)

20 40 60 80 100 120 140 160 Total Number of Reads Mapped

(millions of reads)

0.5 1.0 1.5 2.0 2.5 Reads Mapped at Heterozygous Sites

(millions of reads) 0 50 100 150 200 250 300

Sites of Allelic Imbalance

All R2 = 0.33

ChIP-seq R2 = 0.50

Total Number of Bases Mapped (billions of bases) 0 50 100 150 200 250 300

Sites of Allelic Imbalance

0 50 100 150 200 250 300

Sites of Allelic Imbalance

All R2 = 0.20 ChIP-seq R2 = 0.43 All R 2 = 0.37 ChIP-seq R2 = 0.51 0.5 1.0 1.5 2.0 2.5 3.0 15 20 25 30 35 40 45 Percent Genome Coverage

(1X coverage) 0 50 100 150 200 250 300

Sites of Allelic Imbalance

0 50 100 150 200 250 300

Sites of Allelic Imbalance

All R2 = 0.31

ChIP-seq R2 = 0.35 All R

2 = 0.59

ChIP-seq R2 = 0.90

Percent Genome Coverage (10X coverage) 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

A

B

C

D

E

F

G

H

Figure S4

ChIP-seq data

DNase-seq data

All data

Average Read Depth at Genomic Bases (mapped bases/1X genomic bases)

0 50 100 150 200 250 300

Sites of Allelic Imbalance

All R2 = 0.58 ChIP-seq R2 = 0.78 1.2 1.4 1.6 1.8 2.0 2.2 0 2000 4000 6000 8000 0 50 100 150 200 250 300

Sites of Allelic Imbalance

All R2 = 0.72

ChIP-seq R2 = 0.91

Percent genomic bases with >=10 reads (Genomic bases 10X/genomic bases 1X)

0 50 100 150 200 250 300 All R2 = 0.09 ChIP-seq R2 = 0.86 0.5 1.0 1.5 0.0

Sites of Allelic Imbalance

Heterozygous Sites (5 or more reads containing each allele)

(7)

Allele 1 Allele 2

Variant rsID Allele Reads Allele ReadsImbalance P-value

A rs72807213 G 24 A B rs9388486 T 36 C C rs274035 G 19 A D rs12145434 T 78 A E rs55811458 G A 17 33 17 82 53 0.35 0.81 0.87 0.81 0.30 E D C B* A* A G A T A G C T A G 42 Purified CREB1

Motif predicted for

reference allele only Motif predicted for both alleles

References

Related documents

Building on these ideas, we propose the Patient-Teacher- Tutorial model of outpatient medical student learning (Figure 1) where there are three sequential phases: the medical student

Ms temperature as a function of retained austenite size and C content and the applied tensile stress, for the most favorably oriented austenite feature and the chemical

The product was tested using a test plan approved by EPA, Test/QA Plan for Biological Inactivation Efficiency by HVAC In-Duct Ultraviolet Light Air Cleaners.. (1) The tests

The Copyright Act 1994 offers strong protection for industrial design, and in New Zealand designers have tended to rely more on copyright protection than on registering

(stating that the dispute settlement system seeks to preserve the WTO mem- bers' balance of rights and obligations by providing for uniform interpretation of

The effects of REER volatility of trading partners on Jordan’s trade flows was statistically insignificant for total external trade, ranging from (-0.06 to 0.06) for

Two groups have been formed to study anxiety for students who passed the swimming course (experimental group) and students who did not have any knowledge about

This study also conclude that educational status of mothers, residential area of mothers and antenatal steroids are plays an important role in the outcome of very