• No results found

2. Distilled Sensing Applied to GWAS Data

2.2. Adaptive Sampling

2.2.1. Basic Principles

One of the key obstacles in contemporarily scaled GWAS is that multiple testing correction present a huge burden which hinders the detection of weak-effect loci. This obstacle is inherent when studying any high dimensional data set, particularly when the number of features one could possibly test (in this case the number of independent SNPs) is orders of magnitude larger than the number of samples one has for testing. Furthermore, discovering risk loci for human disease from GWAS data represents a sparse signal

recovery problem, as the number of risk loci researchers expect to be associated with a particular trait is significantly smaller than the total feature space one is searching. In

other words, we expect that the vast majority of SNPs in the genome are probably not associated with any one specific trait/disease. Fortunately the challenge of sparse signal recovery has been well studied in the field of image reconstruction, and it is in drawing upon the experience of information theorists that geneticists can make immediate gains in detection of risk loci from GWAS data, as presented in this dissertation.

Shental et al (2010) achieved superior detection of rare alleles in large population sequencing experiments by applying compressed sensing (CS) strategies to sequencing pipelines(Shental, Amir, and Zuk). Prior to applications in genetics, CS theory was successfully adapted to problems in magnet resonance imaging, single-pixel cameras and geophysics. The general theory behind CS is to leverage the sparseness of a vector of known length (n) to represent the vector using a new set of k linear measurements, where k << n(Montefusco, Lazzaro, and Papi). In many ways, this approach is similar to principle component analysis (PCA), which has been used to study population

stratification in GWA studies(Freedman et al.), however PCA techniques do not assume signals are sparse nor attempt to leverage sparseness to resolve signal.

2.2.2. The Promise of Distilled Sensing for GWAS

Distilled Sensing (DS) is a new technique first proposed by Haupt et al for detecting signals from sparse data sets(Haupt, Castro, and Nowak). The general principle of this method is to iteratively remove areas with weak evidence for an association signal and thereby reduce the search space in later stages where one attempts to resolve the signal using an increasingly more sensitive screen. Previously, this approach has been applied successfully in the fields of deep space photography and image resolution(Haupt,

Castro, and Nowak). Furthermore, DS has shown to be a superior resolution technique compared to standard meta-analysis approaches for image recovery in detecting the real underlying signal within a series of multiple noisy images(Haupt, Castro, and Nowak). DS resolution strategies have three distinct stages. First is a focusing step in which a relatively low-resolution screen is applied to the existing search space in an effort to broadly characterize the amount of signal in various local domains. In a subsequent trimming step, regions from the focusing step lacking sufficient evidence for signal are excluded from further consideration. Finally, after any number of rounds of focusing and trimming, there is a sensing step which entails analyzing the surviving feature space with a scan that has as high (and typically higher) a resolution power as the previous focusing steps’  to finally determine the location and magnitude of the signal. Notably, DS

emphasizes identifying the location of a signal at the expense of characterizing the

signal’s  magnitude, which is estimated in the sensing step using only a portion of the data. In essence, the DS method prioritizes detection of the coordinates of a sparse signal over precise estimation of the strength of the signal. This method is consistent with the

GWAS approach as replication is essential for both validation and improving estimates of the effect size.

However, the potential for DS to increase discovery in GWAS data is dependent on a methodology for implementing each stage of DS analysis to these datasets. For this purpose, we developed the GRIDS method (Genome Reduction by Iterative Distilled Sensing) which uses a Likelihood Ratio Test (LRT) to test simultaneously all SNPs in a defined LD block in the focusing step and then applies a quantile-based threshold to these

regional scores in the trimming step. Finally, GRIDS uses a standard single SNP regression analysis in the sensing step to test the markers in the surviving regions from step two (see Figure 2). Our results using both simulated and real data suggest that this approach improves the ability to detect weak-effect loci over standard meta-analysis techniques in many situations.

Figure 2: GRIDS Pseudo-Code

Genome  Reduction  by  Distilled  Sensing  (GRIDS)

Guiding  Principe:  Given  a  collection  of  noisy  data  it  is  easier  to   determine  where  the  signal  ISN’T  rather  than  find  where  it  IS.   Procedure:  

Step  1:  Determine  the  proportion  of  data  to  be  budgeted  for  the   data  into  focusing  and  sensing  stages  of  the  analysis.   Step  2:  Group  clusters  of  SNPs  into  regional  LD  blocks  determined  

by  the  appropriate  HapMap  reference  panel.  

Step  3:  Apply  low  a  resolution  Likelihood  Ratio  Test  (LRT)  to  SNPs   within  LD  blocks  to  determine  which  regions  show  little   evidence  of  harboring  SNPs  associated  with  risk  of  disease.     Step  4:  Remove  regions  with  the  least  evidence  of  association  from  

the  search  space.

Step  5:  Apply  finer  resolution  SNP-based  association  test  to  

remaining  data  to  only  those  SNPs  in  regions  surviving  data   trimming  in  Step  4.

Step  6:  Determine  which  loci  show  significant  evidence  for   association  with  disease  after  correcting  for  appropriate   multiple  testing  burden  in  trimmed  search  space.  

Related documents