This section highlights the key results presented in this thesis. The results in solving the two mining problems for genome data and two mining problems for streaming and sensor network data are summarized below.
1.4.1
Mining Genome Data
Inferring Segmentation Structure of a Single Recombinant Strain
• ModelI propose the Minimum Segmentation Model to infer the origins of haplo- type segments in a recombinant strain (genotype sequence) given a set of known founders (haplotype sequences). Each segment on a recombinant strain is at- tributable to one of the founders. The minimum segmentation can be used for inferring the relationship among recombinant sequences to identify the genetic ba- sis of traits, which is important for disease association studies. The basic model is also extended to handle noise and support additional biologically-motivated con- straints. The biological constraints include the funnel breeding scheme and the requirement of comparable number of segments on both haplotypes. These con- straints guarantee the biological validity of the solutions as well as significantly reduce the search space. Furthermore, noise (point mutations, gene conversions, and genotyping errors) and missing values, as common to all biological datasets, are incorporated to improve the robustness of the model.
• AlgorithmI propose efficient dynamic programming algorithm to solve the Min- imum Segmentation problem in polynomial time based on block-wise decomposi- tion of the sequences. The algorithm has a time complexity of O(LN +P4) and a space complexity ofO(P N2), where Lis the number of SNPs, N is the number
of founders, and P is the number of blocks.
• Performance The proposed algorithms permits genome wide analysis on real mouse genome datasets (Collaborative Cross), and generates feasible biological solutions on CC (preCC and G2F1 strains). The proposed algorithm can also handle noise and missing values properly on these strains.
Inferring Mosaic Structure of a Recombinant Strain Panel
• Model I propose the Minimum Mosaic Model to capture the minimum number of recombination breakpoints required for generating a set of genome sequences (in haplotypes). This mosaic structure provides a good estimation of the rate and possible locations of the recombination events, which is useful for inferring haplotype block structures and genealogical history.
• AlgorithmI proposed an efficient graph-based algorithm for computing the min- imum mosaic structure for a given set of haplotype sequences. The strains in the SNP arrays are divided into blocks. For any two neighboring haplotype blocks, the local breakpoints are inferred according to the Four-Gamete Test (FGT). Pos- sible local breakpoint sets (as graph nodes) are connected to form a combinatorial search space for minimum solution.
• Performance The proposed algorithm permits genome-wide analysis. The ex- periments on CGD mouse genome datasets with 51 mouse strains over total 7.8M SNPs demonstrates the good performance of the algorithm (less than half an hour for each chromosome).
1.5
Mining Environmental and Ecological Data
Clustering Distributed Data Streams
• ModelI define the problem of approximate k-Median Clustering over data arriving at a distributed data stream system such as a sensor network.
• Algorithm I propose a suite of resource-aware algorithms for continuously com- puting (1 +)-approximate k-median clustering over distributed data streams un- der three different topology settings: topology-oblivious, height-aware, and path- aware. I incorporated the in-network aggregation techniques to avoid the raw
data transmission between sites, which is the main drain on the sensor’s battery. Efficient summary structures are designed to reduce data transmission as well as guarantee bounded-error solution. The summary structure is constructed by di- viding the in coming streams into fixed-size blocks. The algorithms reduce the maximum per node transmission topolylogN (opposed to N for transmitting the raw data).
• Performance The algorithms demonstrate the scalability of the algorithms with respect to the data volume, approximation factor, and the number of sites. All three algorithms can greatly reduce the total as well as per node transmission, especially with larger-scale data.
Order Statistics Computation on High-Speed Data Streams
• ModelI solve the problem of approximate quantile and biased-quantile for high- speed data streams.
• AlgorithmI propose a fast algorithm for computing approximate quantiles in high speed data streams with deterministic error bounds. The algorithm uses simple block-wise merge and sample operations. Overall, the proposed algorithm achieves a per-element update computational cost ofO(log(1/log(N))) for approximation factor and stream size N (N is not know beforehand). In addition, I proposed an efficient algorithm for computing approximate biased quantiles in large data streams. The algorithm is based on a novel piece-wise uniform sampling technique which computes decomposable biased quantile summaries on fixed size blocks of the incoming data stream. The algorithm is computationally efficient, does not assume prior knowledge of the stream sizes or the range of data values in the streams, and is applicable to distributed data stream system. In practice, the algorithm is able to efficiently maintain summaries over large data streams with
over tens of millions of observations.
• PerformanceExperiments demonstrate that the algorithms are able to efficiently maintain summaries over large data streams with over tens of millions of observa- tions.