CHAPTER 3: ALLELIC IMBALANCE DETECTION IN QUANTITATIVE SEQUENCE
3.2.11 Experimental conformation of allelic differences in protein binding and
Four of the imbalanced sites located at GWAS loci also have experimental evidence
demonstrating allelic differences in protein binding and/or enhancer activity (Figure 3.4A). At rs4969182 near PGS1, we detected enrichment of the A allele in data for 6 proteins including
FOXA1 and FOXA2. Experimentally, this same allele shows increased binding of FOXA1 and
FOXA2 to the A allele, as well as increased enhancer activity33. Likewise at rs4846913, near
GALNT2, we predicted allelic imbalance in 5 datasets and the enriched allele demonstrated
increased protein binding to C/EBPb and enhancer activity120. The enriched allele at
rs62102718 near PEPD, also has experimentally validated allelic differences in protein
64
Finally, we tested rs6813195 for allelic differences in enhancer activity in MIN6, mouse
insulinoma, cells using a dual luciferase assay and observed increase enhancer activity for the
allele predicted to have increased binding of FOXA2 in human islets (Figure 3.4B). Together these experimental data highlight the utility of using allelic imbalance detection to predict allelic
differences in protein binding and transcriptional activity at cardiometabolic phenotype-
associated loci.
3.3 Discussion
Allelic imbalance detection in quantitative sequence data is a powerful tool for
understanding genetic effects on the regulation of gene transcription. We used AA-ALIGNER to
detect allelic imbalance in ChIP-seq and DNase-seq data generated in cell lines and primary
cells from tissues playing a role in cardiometabolic phenotypes. Imbalance detection in these
samples has provided not only biological insights into the regulation of gene transcription at
specific cardiometabolic GWAS loci, but also more general insights into protein binding at
imbalanced sites.
We found evidence of allelic imbalance at hundreds of loci associated with
cardiometabolic traits and diseases. While these imbalanced sites may be located at GWAS loci
by chance, it is likely that many of them are playing an active role in regulating the transcription
of nearby genes and influencing the associated phenotype. For example, LD data suggests that
at two variants near GRB14, rs6713419 and rs10184004, the alleles predicted to have
increased MAFK binding are on the same haplotype as the alleles associated with both
increased triglyceride levels and type 2 diabetes risk. This effect is likely mediated by changes
in gene transcription, and GBR14, whichbinds to the insulin receptor and negatively regulates
insulin signaling122, is a strong candidate target. Differential protein binding could influence
GRB14 transcription and ultimately insulin signaling, although experimental validation is needed
65
activity have been experimentally observed, however, at three other sites with allelic imbalance,
and we are confident that future experimental testing will produce similar evidence for additional
imbalanced sites. As the GRB14 locus demonstrates, predicted imbalances can provide a
starting hypothesis for these experiments and expedite experimental exploration of gene
transcription regulation at GWAS loci.
In addition to GWAS loci, we also found allelic imbalance at sites associated with gene
expression. One variant with allelic imbalance, rs12091564 is associated with allelic differences
in NOTCH2NL transcription in islets as well as coronary artery disease risk. The Notch signaling
plays a role in cardiovascular disease123, making it plausible that differential regulations of
NOTCH2NL by rs12091564 influences coronary artery disease risk. Two other imbalance sites,
rs13356762 and rs185220 are associated with C5orf35 expression and T2D. This gene
encodes SETD9 and although it is unclear what role this protein might play in T2D risk, our
allelic imbalance results provide a candidate variant to test for differences in regulatory activity.
We have additionally identified allelic imbalance at eQTLs outside of GWAS loci that may not be
immediately applicable in understanding the genetic effects on cardiometabolic phenotypes, but
could be important for understanding genetic effects on gene transcription in general.
In addition to providing candidate regulatory variants for experimental study, our
analyses have provided us with some insights into the mechanics of protein binding at sites of
allelic imbalance. First, we observed enrichment of reads containing the major allele at more
imbalanced sites than expected by chance, suggesting that variants promoting increased
protein binding may be evolutionarily favored, or conversely, variants disrupting binding
disfavored. Second, we have used allelic imbalance to perform a preliminary exploration of the
binding relationship of proteins co-localized to the same heterozygous site. We found evidence
of an association between the presence of imbalance in CTCF and cohesin subunits Rad21 and
66
localization in HepG2 cells124 and a direct interaction between CTCF and Rad21118 . While our
analysis offers preliminary evidence of direct binding relationships between proteins, it is
important to note that it may be limited by many factors such as accuracy of binding motif
locations, sequencing depth, and ChIP-seq data availability.
Data availability is one of the greatest limiting factors of imbalance detection. Our
analyses were particularly limited by the small number of ChIP-seq samples generated in
pancreatic islets and adipose tissue. It is likely that in these tissues we were unable to detect
allelic imbalance at many sites influencing gene transcription at cardiometabolic GWAS loci. We
were limited further because allelic imbalance detection can only be done at heterozygous sites.
Even with the abundance of data from a liver cell line, we failed to detect allelic imbalance at
sites with documented allelic differences in protein binding in liver because these sites are
homozygous in HepG2 cells23,27,30. Despite this limitation, allelic imbalance detection is very
useful even in only a single dataset. Analyzing quantitative sequence data from more than one
individuals would help to overcome this limitation, but analyzing large numbers of ChIP-seq
datasets in multiple individuals can be resource prohibitive.
DNase-seq data can identify the binding sites of many transcription factors in a single
assay and is an attractive option for identifying protein binding sites in a population of
individuals125. While we detected allelic imbalance in ChIP-seq data at a majority of sites
imbalanced in DNase-seq data, we only predicted a small fraction of sites imbalanced in ChIP-
seq data using DNase-seq data. DNase-seq has a more disperse signal than most ChIP-seq
data and requires a much deeper sequencing depth to achieve the same signal intensity found
in ChIP-seq data with fewer reads. As signal intensity was highly correlated with imbalance
detection, it is likely that with greater sequencing depth DNase-seq data would be able to
identify a greater proportion of ChIP-seq imbalances. Protein binding to DNA creates a localized
67
footprints further limits imbalance detection in DNase-seq data at heterozygous sites directly
bound by protein. Additionally, the number of cells required to generate adequate sequencing
depth with DNase-seq can be prohibitive when using a limited number of primary cells. ATAC-
seq, similar to DNase-seq, requires fewer cells and may reduce this limitation, but further study
is needed to assess the efficiency and accuracy of allelic imbalance in that data.
We have limited our analyses to ChIP-seq and DNase-seq data generated in a single
liver cell line, pancreatic islet samples from 12 individuals, and two adipose cell lines. Additional
protein ChIP-seq and DNase-seq data exists for other liver and pancreatic cell lines as well as
primary cells and samples from these tissues. Additionally, RNA-seq, FAIRE-seq and histone
modification ChIP-seq are also available for samples from these and other samples related to
cardiometabolic phenotypes. As we expand allelic imbalance identification into this additional
data, we expect to find additional evidence of allelic imbalance at cardiometabolic phenotype-
associated loci and gain further insight into transcriptional activity at these loci.