Adaptive Sampling - Distilled Sensing Applied to GWAS Data

2. Distilled Sensing Applied to GWAS Data

2.2. Adaptive Sampling

2.2.1. Basic Principles

One of the key obstacles in contemporarily scaled GWAS is that multiple testing correction present a huge burden which hinders the detection of weak-effect loci. This obstacle is inherent when studying any high dimensional data set, particularly when the number of features one could possibly test (in this case the number of independent SNPs) is orders of magnitude larger than the number of samples one has for testing. Furthermore, discovering risk loci for human disease from GWAS data represents a sparse signal

recovery problem, as the number of risk loci researchers expect to be associated with a particular trait is significantly smaller than the total feature space one is searching. In

other words, we expect that the vast majority of SNPs in the genome are probably not associated with any one specific trait/disease. Fortunately the challenge of sparse signal recovery has been well studied in the field of image reconstruction, and it is in drawing upon the experience of information theorists that geneticists can make immediate gains in detection of risk loci from GWAS data, as presented in this dissertation.

Shental et al (2010) achieved superior detection of rare alleles in large population sequencing experiments by applying compressed sensing (CS) strategies to sequencing pipelines(Shental, Amir, and Zuk). Prior to applications in genetics, CS theory was successfully adapted to problems in magnet resonance imaging, single-pixel cameras and geophysics. The general theory behind CS is to leverage the sparseness of a vector of known length (n) to represent the vector using a new set of k linear measurements, where k << n(Montefusco, Lazzaro, and Papi). In many ways, this approach is similar to principle component analysis (PCA), which has been used to study population

stratification in GWA studies(Freedman et al.), however PCA techniques do not assume signals are sparse nor attempt to leverage sparseness to resolve signal.

2.2.2. The Promise of Distilled Sensing for GWAS

Distilled Sensing (DS) is a new technique first proposed by Haupt et al for detecting signals from sparse data sets(Haupt, Castro, and Nowak). The general principle of this method is to iteratively remove areas with weak evidence for an association signal and thereby reduce the search space in later stages where one attempts to resolve the signal using an increasingly more sensitive screen. Previously, this approach has been applied successfully in the fields of deep space photography and image resolution(Haupt,

Castro, and Nowak). Furthermore, DS has shown to be a superior resolution technique compared to standard meta-analysis approaches for image recovery in detecting the real underlying signal within a series of multiple noisy images(Haupt, Castro, and Nowak). DS resolution strategies have three distinct stages. First is a focusing step in which a relatively low-resolution screen is applied to the existing search space in an effort to broadly characterize the amount of signal in various local domains. In a subsequent trimming step, regions from the focusing step lacking sufficient evidence for signal are excluded from further consideration. Finally, after any number of rounds of focusing and trimming, there is a sensing step which entails analyzing the surviving feature space with a scan that has as high (and typically higher) a resolution power as the previous focusing steps’ to finally determine the location and magnitude of the signal. Notably, DS

emphasizes identifying the location of a signal at the expense of characterizing the

signal’s magnitude, which is estimated in the sensing step using only a portion of the data. In essence, the DS method prioritizes detection of the coordinates of a sparse signal over precise estimation of the strength of the signal. This method is consistent with the

GWAS approach as replication is essential for both validation and improving estimates of the effect size.

However, the potential for DS to increase discovery in GWAS data is dependent on a methodology for implementing each stage of DS analysis to these datasets. For this purpose, we developed the GRIDS method (Genome Reduction by Iterative Distilled Sensing) which uses a Likelihood Ratio Test (LRT) to test simultaneously all SNPs in a defined LD block in the focusing step and then applies a quantile-based threshold to these

regional scores in the trimming step. Finally, GRIDS uses a standard single SNP regression analysis in the sensing step to test the markers in the surviving regions from step two (see Figure 2). Our results using both simulated and real data suggest that this approach improves the ability to detect weak-effect loci over standard meta-analysis techniques in many situations.

Figure 2: GRIDS Pseudo-Code

Genome Reduction by Distilled Sensing (GRIDS)

Guiding Principe: Given a collection of noisy data it is easier to determine where the signal ISN’T rather than find where it IS. Procedure:

Step 1: Determine the proportion of data to be budgeted for the data into focusing and sensing stages of the analysis. Step 2: Group clusters of SNPs into regional LD blocks determined

by the appropriate HapMap reference panel.

Step 3: Apply low a resolution Likelihood Ratio Test (LRT) to SNPs within LD blocks to determine which regions show little evidence of harboring SNPs associated with risk of disease. Step 4: Remove regions with the least evidence of association from

the search space.

Step 5: Apply finer resolution SNP-based association test to

remaining data to only those SNPs in regions surviving data trimming in Step 4.

Step 6: Determine which loci show significant evidence for association with disease after correcting for appropriate multiple testing burden in trimmed search space.

In document New methods for studying complex diseases via genetic association studies (Page 38-42)