In the previous section, we saw how the UoI-framework helped us to develop a novel NMF algorithm that was robust to noise and yielded interpretable parts based decom- positions of the data. U oI-N M Fcluster selects the right set of bases and estimates the
weights separately to reconstruct the data optimally and avoid the reconstruction of the noise. These superior results fromU oI-N M Fcluster suggest that a similar approach
In the first chapter, we introduced the CUR decomposition, one of the popular dimensionality reduction method used in many applications. Here, we adapt the UoI framework for CUR decomposition to develop the two stage U oI-CU Ralgorithm. The algorithm reduces the variance of column/row sampling via the intersection operation.
UoI-CUR: The UoI-CUR algorithm is as follows: We consider the bootstrap resam- pling approach (for column selection, we subsample rows and vice versa). We com- pute the different subsets of columns (and rows)Ci for the different bootstrap samples
i= 1, . . . , B1, and for different ranks kusing leverage score sampling.
Intersection: We then intersect the support (indices) of the subsets of columns (and rows) Ci over the bootstraps to obtain a smaller intersected subset ˆC(k) (for different
ranks k). This intersection operation reduces the variance in sampling.
Union: Next, we obtain a larger union set of columns by taking union of the intersected subsets ˆC(k) over different ranks k.
10.4.1 Experiments - Tagging gene expressions
Analysis of gene expression DNA microarray data has become popular for studying a variety of biological processes [105]. In the microarray data, we have m genes (from m
individuals possibly from different populations) and a series of narrays probe genome- wide expression levels in n different samples, possibly under n different experimental conditions. Hence, the data from microarray experiments is represented as a matrix
A ∈ Rm×n, where A
ij indicates whether the jth expression level exists for gene i.
Typically, the matrix could have entries {−1,0,1} indicating whether the expression exists (±1) or not (0) and the sign indicating the order of the sequence. In chapter 5, we saw how CUR decomposition and coarsening techniques can be used to select a subset of gene expressions or single nucleotide polymorphisms (SNPs) called thetagging SNPs (tSNPs) which best represent the gene pools. In this section, we demonstrate how the U oI-CU R algorithm performs in this application.
We consider here the same two datasets which we considered in chapter 5. Table 10.2 lists the errors obtained from the three different methods, namely,UoI-CUR, basic CUR and Greedy selection [105] for different populations. The error reported is again given by
165 Table 10.2: TaggingSNP: UoI-CUR, basic CUR and Greedy selection
Data Size c U oI-CU R Basic CUR Greedy CUR
Yaledataset/SORCS3 1966×53 30 0.0096 0.0323 0.0062 Yaledataset/PAH 1979×32 20 0.0165 0.0308 0.0165 Yaledataset/HOXB 1953×96 36 0.0690 0.1369 0.0272 Yaledataset/17q25 1962×63 35 0.0507 0.0895 0.0197 HapMap/SORCS3 268×307 83 0.0023 0.0624 0.0023 HapMap/PAH 266×88 42 0.0087 0.0130 0.0053 HapMap/HOXB 269×571 57 0.0840 0.1696 0.0211 HapMap/17q25 265×370 80 0.0421 0.1819 0.0162 15 20 25 30 35 40 0 0.02 0.04 0.06 0.08 0.1 0.12 Number of colmuns c−−> Error 15,19,26,30,33,36, UoI−CUR basic−CUR Greedu−CUR 20 40 60 80 100 120 140 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 Number of colmuns c−−> Error 32,47,57,61,73,85,103,114,117,136, UoI−CUR basic−CUR Greedy−CUR
Figure 10.7: U oI-CU R: ErrorkA−PCAkF as a function of the number of columnsc.
matrix, ˆA=CC†A, is the projection ofAontoCandnnz(A) is the number of elements in A. Recall that the greedy algorithm is very expensive but performs very well in practice. We observe that the UoI-CUR algorithm performs better than basic CUR and the performance is comparable with the greedy algorithm in many cases.
Figure 10.7 plots the results for two of the genetics data sets considered in Table 10.2, as a function of the number of columns c. UoI resulted in reconstruction errors that were consistently lower than the base method (basic-CUR), and quickly approached an unscalable greedy algorithm (Greedy-CUR) as the number of columnscincreases. Thus, in both cases, UoI improved the prediction parsimony relative to the base method.
Material Informatics - Mining
material data
11.1
Introduction
In recent years, as a result of the Material Genome Initiative1, machine learning (ML) techniques have emerged among other ‘material informatics’ methods, for exploring materials data. Material informatics techniques based on machine learning have been shown to be inexpensive means of exploiting materials data, and can be used to examine a variety of thermodynamics properties. In the final chapter of this thesis, we apply well-known supervised regression techniques to predict properties of compounds that are hard and expensive to compute otherwise, using easily available physical, chemical and structural properties of the compounds, known as features in machine learning or descriptors in material science. The goal of this work is to help bypass time intensive calculations (some take days of computations), e.g., ab-initio calculations, used in ma- terial science. In particular, we present of a machine learning (regression) technique for prediction of the formation enthalpies of new metal alloys using easily available material data.
1
https://www.mgi.gov/
167