Experimental conformation of allelic differences in protein binding and

CHAPTER 3: ALLELIC IMBALANCE DETECTION IN QUANTITATIVE SEQUENCE

3.2.11 Experimental conformation of allelic differences in protein binding and

Four of the imbalanced sites located at GWAS loci also have experimental evidence

demonstrating allelic differences in protein binding and/or enhancer activity (Figure 3.4A). At rs4969182 near PGS1, we detected enrichment of the A allele in data for 6 proteins including

FOXA1 and FOXA2. Experimentally, this same allele shows increased binding of FOXA1 and

FOXA2 to the A allele, as well as increased enhancer activity33_{. Likewise at rs4846913, near}

GALNT2, we predicted allelic imbalance in 5 datasets and the enriched allele demonstrated

increased protein binding to C/EBPb and enhancer activity120_{. The enriched allele at}

rs62102718 near PEPD, also has experimentally validated allelic differences in protein

Finally, we tested rs6813195 for allelic differences in enhancer activity in MIN6, mouse

insulinoma, cells using a dual luciferase assay and observed increase enhancer activity for the

allele predicted to have increased binding of FOXA2 in human islets (Figure 3.4B). Together these experimental data highlight the utility of using allelic imbalance detection to predict allelic

differences in protein binding and transcriptional activity at cardiometabolic phenotype-

associated loci.

3.3 Discussion

Allelic imbalance detection in quantitative sequence data is a powerful tool for

understanding genetic effects on the regulation of gene transcription. We used AA-ALIGNER to

detect allelic imbalance in ChIP-seq and DNase-seq data generated in cell lines and primary

cells from tissues playing a role in cardiometabolic phenotypes. Imbalance detection in these

samples has provided not only biological insights into the regulation of gene transcription at

specific cardiometabolic GWAS loci, but also more general insights into protein binding at

imbalanced sites.

We found evidence of allelic imbalance at hundreds of loci associated with

cardiometabolic traits and diseases. While these imbalanced sites may be located at GWAS loci

by chance, it is likely that many of them are playing an active role in regulating the transcription

of nearby genes and influencing the associated phenotype. For example, LD data suggests that

at two variants near GRB14, rs6713419 and rs10184004, the alleles predicted to have

increased MAFK binding are on the same haplotype as the alleles associated with both

increased triglyceride levels and type 2 diabetes risk. This effect is likely mediated by changes

in gene transcription, and GBR14, whichbinds to the insulin receptor and negatively regulates

insulin signaling122_{, is a strong candidate target. Differential protein binding could influence}

GRB14 transcription and ultimately insulin signaling, although experimental validation is needed

activity have been experimentally observed, however, at three other sites with allelic imbalance,

and we are confident that future experimental testing will produce similar evidence for additional

imbalanced sites. As the GRB14 locus demonstrates, predicted imbalances can provide a

starting hypothesis for these experiments and expedite experimental exploration of gene

transcription regulation at GWAS loci.

In addition to GWAS loci, we also found allelic imbalance at sites associated with gene

expression. One variant with allelic imbalance, rs12091564 is associated with allelic differences

in NOTCH2NL transcription in islets as well as coronary artery disease risk. The Notch signaling

plays a role in cardiovascular disease123_{, making it plausible that differential regulations of}

NOTCH2NL by rs12091564 influences coronary artery disease risk. Two other imbalance sites,

rs13356762 and rs185220 are associated with C5orf35 expression and T2D. This gene

encodes SETD9 and although it is unclear what role this protein might play in T2D risk, our

allelic imbalance results provide a candidate variant to test for differences in regulatory activity.

We have additionally identified allelic imbalance at eQTLs outside of GWAS loci that may not be

immediately applicable in understanding the genetic effects on cardiometabolic phenotypes, but

could be important for understanding genetic effects on gene transcription in general.

In addition to providing candidate regulatory variants for experimental study, our

analyses have provided us with some insights into the mechanics of protein binding at sites of

allelic imbalance. First, we observed enrichment of reads containing the major allele at more

imbalanced sites than expected by chance, suggesting that variants promoting increased

protein binding may be evolutionarily favored, or conversely, variants disrupting binding

disfavored. Second, we have used allelic imbalance to perform a preliminary exploration of the

binding relationship of proteins co-localized to the same heterozygous site. We found evidence

of an association between the presence of imbalance in CTCF and cohesin subunits Rad21 and

localization in HepG2 cells124_{and a direct interaction between CTCF and Rad21}118_{. While our}

analysis offers preliminary evidence of direct binding relationships between proteins, it is

important to note that it may be limited by many factors such as accuracy of binding motif

locations, sequencing depth, and ChIP-seq data availability.

Data availability is one of the greatest limiting factors of imbalance detection. Our

analyses were particularly limited by the small number of ChIP-seq samples generated in

pancreatic islets and adipose tissue. It is likely that in these tissues we were unable to detect

allelic imbalance at many sites influencing gene transcription at cardiometabolic GWAS loci. We

were limited further because allelic imbalance detection can only be done at heterozygous sites.

Even with the abundance of data from a liver cell line, we failed to detect allelic imbalance at

sites with documented allelic differences in protein binding in liver because these sites are

homozygous in HepG2 cells23,27,30_{. Despite this limitation, allelic imbalance detection is very}

useful even in only a single dataset. Analyzing quantitative sequence data from more than one

individuals would help to overcome this limitation, but analyzing large numbers of ChIP-seq

datasets in multiple individuals can be resource prohibitive.

DNase-seq data can identify the binding sites of many transcription factors in a single

assay and is an attractive option for identifying protein binding sites in a population of

individuals125_{. While we detected allelic imbalance in ChIP-seq data at a majority of sites}

imbalanced in DNase-seq data, we only predicted a small fraction of sites imbalanced in ChIP-

seq data using DNase-seq data. DNase-seq has a more disperse signal than most ChIP-seq

data and requires a much deeper sequencing depth to achieve the same signal intensity found

in ChIP-seq data with fewer reads. As signal intensity was highly correlated with imbalance

detection, it is likely that with greater sequencing depth DNase-seq data would be able to

identify a greater proportion of ChIP-seq imbalances. Protein binding to DNA creates a localized

footprints further limits imbalance detection in DNase-seq data at heterozygous sites directly

bound by protein. Additionally, the number of cells required to generate adequate sequencing

depth with DNase-seq can be prohibitive when using a limited number of primary cells. ATAC-

seq, similar to DNase-seq, requires fewer cells and may reduce this limitation, but further study

is needed to assess the efficiency and accuracy of allelic imbalance in that data.

We have limited our analyses to ChIP-seq and DNase-seq data generated in a single

liver cell line, pancreatic islet samples from 12 individuals, and two adipose cell lines. Additional

protein ChIP-seq and DNase-seq data exists for other liver and pancreatic cell lines as well as

primary cells and samples from these tissues. Additionally, RNA-seq, FAIRE-seq and histone

modification ChIP-seq are also available for samples from these and other samples related to

cardiometabolic phenotypes. As we expand allelic imbalance identification into this additional

data, we expect to find additional evidence of allelic imbalance at cardiometabolic phenotype-

associated loci and gain further insight into transcriptional activity at these loci.

In document Buchkovich_unc_0153D_15588.pdf (Page 77-81)