Selected Topics in Statistical Computing and Genomic Data Analysis.

(1)

ABSTRACT

YANG, MENG. Selected Topics in Statistical Computing and Genomic Data Analysis. (Under the direction of Eric C. Chi and Jung-Ying Tzeng).

Biclustering approaches are able to recover checkerboard structure accurately and efficiently like Spectral Biclustering and Convex Biclustering. As we know, outliers are common in the practical problems and they will seriously bias the estimation sometimes. A natural question is: Are we still able to recover the checkerboard structure accurately and efficiently even with the appearance of the outliers? This article aims to answer this question by incorporating the minimum distance estimator into the biclustering problem assuming the outliers follow a mean shift outlier model. It has been shown that minimum distance estimation (L2E) is more robust than maximum likelihood estimation (MLE) in the parametric modeling problem. Moreover, L2E has been applied in various areas and shown a strong resistance against outliers. Motivated by these previous work, we hope the newly constructed method can not only reveal the hidden checkerboard structure behind a data matrix but also has good resistance to the outliers. In the Chapter 2, we demonstrate the advantages of our approach via simulations on both artificial data and real microarray lung cancer data and the disadvantages.

(2)

(3)

(4)

Selected Topics in Statistical Computing and Genomic Data Analysis

by Meng Yang

A dissertation submitted to the Graduate Faculty of North Carolina State University

in partial fulfillment of the requirements for the Degree of

Doctor of Philosophy

Statistics

Raleigh, North Carolina 2019

APPROVED BY:

Wenbin Lu Arvind K. Saibaba

Eric C. Chi

(5)

DEDICATION

(6)

BIOGRAPHY

The author grew up in Beijing, the capital of China. He spent his four years in Nankai University, an famous university in Tianjin, China, to obtain his bachelor’s degree in Financial Engineering. Meanwhile, based on his interest, he also obtained a minor in applied Mathematics by completing all of the main courses and several selective courses in the Mathematics department in Nankai University. After the graduation in summer in 2012, he came to U.S. to pursue his Master and Ph.D. degrees in North Carolina State University. He majored in Financial Mathematics in the Mathematics department in North Carolina State University, but one year later, he realized he has lost his interest in Finance and more interested in Statistics than any other subjects because he believed it is tool letting the data talk that matters most. Therefore, he decided to change his major to Statistics. Having decided to change major, He worked very hard to learn Statistics as much as possible while he was in Financial Mathematics in North Carolina State University. After two years hard working, he finally managed to get an admission to the Master program in Statistics in North Carolina State University but also successfully complete his courses in Financial Mathematics program. One year later after passing the Basic exam, he transferred from the Master program to the Doctoral program in Statistics in North Carolina State University.

(7)

(8)

ACKNOWLEDGEMENTS

I would like to thank my two main advisors, Eric Chi and Jung-Ying Tzeng, for all of their guidance, knowledge, patience, understanding over the last few years. I would also like to thank Arvind Saibaba and Wenbin Lu for serving on my committee.

Dr. Tzeng effectively served as a co-advisor. I very much appreciate her careful and detailed guidance during the whole project. The project is not as easy as we expect before we start since it has many numerical issues when computing the variance and optimizing. I was thinking to give up several times but I am fortunate that I did not and have her as one of co-advsior. Not only did she guide me how to do research and solve potential issues, but also she also taught me how to keep calm and continue working without stressing too much. Moreover, she always encouraged me even though I made some careless mistakes or reported work with poor quality. She emphasized the importance of documentations, which can be a great help for a years-long project. When I applied internship, she and her husband also help me with my resume and provided many invaluable advice and finally I grabbed the chance of being a GRA student in SAS, a great company everyone in Statistics department in NCSU longing for. I am very fortunate that Dr. Chi joined the NC State faculty and served also as a co-advisor on my committee. I really appreciate his emphasis on Statistics Computing, Software Developer and Mathematics. Since I lack research experience before joining the Doctoral program, he focuses on teaching me how to be a good independent researcher during the project. He told what the real researcher is supposed to do and shared many of his previous experiences to me when he was a student. Besides, he also told me how to review papers and shared many of his precious previous working samples to me. Working with him has definitely made me a more effective researcher.

(9)

(10)

(11)

TABLE OF CONTENTS

LIST OF TABLES . . . x

LIST OF FIGURES. . . xi

Chapter 1 Introduction . . . 1

1.1 Standard Linear Regression Model . . . 3

1.2 Penalized Linear Regression Model . . . 6

1.3 Biclustering And Convex Biclustering . . . 8

1.4 Multi-Platform Genomic Data Analysis . . . 10

Chapter 2 Robust Biclustering: A method based on the Minimum Distance Estimator . . . 14

2.1 Introduction . . . 14

2.2 A Brief Introduction to the Parametric MinimumL2 Error Criterion . . . 18

2.3 A Brief Introduction to MM Algorithm . . . 20

2.4 L2 Error Biclustering Criterion and Its Properties . . . 21

2.5 Connection to the M-estimator . . . 26

2.6 Estimation of Biclusters with LEBRA . . . 27

2.7 Robustness of LEBRA . . . 30

2.8 Convergence of LEBRA . . . 31

2.9 Model Selection . . . 32

2.10 Simulation Studies . . . 34

2.11 Application to Genomics . . . 40

2.12 Discussion . . . 46

Chapter 3 Hypothesis Testing for Tensor Regression with Application to Gene-set Integrative Analysis of Multi-omics Data. . . 48

3.1 Abstract . . . 48

3.2 Introduction . . . 49

3.3 Methods . . . 52

3.3.1 Notation and Preliminaries . . . 52

3.3.2 Tensor Regression Model for Integrative Gene-Set Analysis . . . 54

3.3.3 Estimation . . . 55

3.3.4 Association Test for Single Genomic Variables . . . 58

3.4 Simulation Studies . . . 59

3.4.1 Performance of Detecting Causal Variables . . . 60

3.4.2 Performance of Detecting Causal Variables . . . 61

3.5 Real Data Analysis . . . 66

3.6 Discussion . . . 69

3.7 Future work . . . 72

(12)

4.1 Methylation Data Processing Pipeline . . . 74

4.2 Transcriptome Profiling Data Processing Pipeline . . . 80

4.3 CNV Data Processing Pipeline . . . 84

References. . . 87

Appendices . . . .101

Appendix A Robust Biclustering: A method based on the Minimum Distance Estimator102 A.1 Proofs . . . 102

A.2 Additional Results . . . 109

A.3 MEASURES OF CLUSTERING SIMILARITY . . . 115

Appendix B Hypothesis Testing for Tensor Regression with Application to Gene-set Integrative Analysis of Multi-omics Data . . . 120

B.1 Background on Tensor, Kronecker product and Box product . . . 120

B.2 Rank Selection: AIC vs BIC . . . 125

B.3 Proof of Prop 1 . . . 127

B.4 EM algorithm for binary data . . . 131

B.5 Additional Information for Real Data Analysis Results . . . 137

Appendix C TCGA Data Processing Pipeline . . . 140

C.1 Codes For Methylation Data Processing Pipeline . . . 140

C.1.1 Methylation_clean0.sh . . . 140

C.1.4 Methy_pathway_selection.sh . . . 145

C.1.5 Methy_sampleID_splitting.sh . . . 146

C.2 Codes For CVNs Data Processing Pipeline . . . 147

C.2.1 CNV_clean_step0.sh . . . 147

C.3 Codes For Transcripts Data Processing Pipeline . . . 151

(13)

LIST OF TABLES

Table 3.1 Performance of detecting causal genomic variables of tensor regression (TR), linear model (LM), and LASSO with effect sizeC = 0.25, sample sizen= 500 andB with dimension3×74based on50,000replications. For TR and LM, a variable is selected as significant if its p-value is less thanα = 0.05/(3×74) = 0.00023. . . 63 Table 3.2 This table shows the simulation results for "I" under 1000 replications, 74

genes and500sample size. Bonferroni correction is not used. . . 65 Table 3.3 This table shows the simulation results for "I", "Random" and "T" under

(14)

LIST OF FIGURES

Figure 1.1 Blue line is the regression line based on clean data. Red line is the regression line based on contaminated data . . . 5 Figure 1.2 The objective functions inR2of Ridge Regression and Lasso and their support

from two angles. The region lies in dark red cylinder is the support of ridge regression, yet the dark green cuboid is the support for lasso. The purple dot is the OLS estimate. The contours of OLS objective function is in rainbow color. Whatever approaches is looking for the global minimum of the OLS objective function, yet penalized regression approaches are looking for the global minimum within the support defined byJ. . . 7

Figure 1.3 Different groups are labeled with different colors. We have n objects and p features. The data matrix is an-by-pmatrix. . . 8 Figure 1.4 Two categories under the "joint-modeling" based analysis: tensor based (left

side) and transformation based (right side). There are other categories such as iBAG (model based, Wang, Baladandayuthapani, Morris, Broom, Manyam, and Do (2012)). . . 12

Figure 2.1 Compete MM procedure looking for the minimum value ofx∞= argminxf(x),

The thick solid curve is f(x). The dash curve are majorants based on each update xi . . . 21

Figure 2.2 Row Size: 100,30,30,40; Column size: 50,60,50,40; No outliers . . . 36 Figure 2.3 Row Size: 10,100,50,40; Column size: 10,100,50,40; No outliers . . . 37 Figure 2.4 Row Size:100,30,30,40; Column size:60,60,60,50. The number of outliers:

4000 . . . 38 Figure 2.5 Row Size:10,100,50,40; Column size:10,100,50,40. The number of outliers:

4000 . . . 39 Figure 2.6 Snap shots of the LEBRA solution path of the lung cancer dataset as the

penalty parameter λincreases. All of the matrices have been forced to have the same column order and row order for comparison purpose. It shows the whole trend of the LEBRA solution path from the estimate with not enough strong penalty strength to the over-smooth constant estimate. . . 41 Figure 2.7 Performance of COBRA, LEBRA and SpBC under the measure of Rand

Index, Variational Information, Adjusted Rand Index and Normalized Mutual Information . . . 43 Figure 2.8 Running time of the previous simulation experiment needed by COBRA,

LEBRA and SpBC . . . 44 Figure 2.9 Performance of COBRA, LEBRA and SpBC for different magnitude of

out-liers . . . 45

(15)

Figure 3.2 Bias of variance estimates for the tensor coefficient estimates with bias= estimated variance−empirical variance. The 3 panels from left to right

rep-resent 3 different signal patterns in B: "I" shape (R = 1), "Random" shape (R = 2),"T" shape (R = 2); . Orange boxplots are biases for Bpg = 0 and

blue boxplots are biases forBpg 6= 0. . . 62

Figure 4.1 Interpretation of the sample ID (https://docs.gdc.cancer.gov) . . . 75 Figure 4.2 DNA methylation overview. Methylation occurs at CpG dinucleotides

regu-lated by DNA methyltransferases (Dnmts). Dnmt1 is predominantly associ-ated with methylation of hemimethylassoci-ated CpG dinucleotides while Dnmt3a and Dnmt3b contribute tode novo methylation. . . 76 Figure 4.3 The Structure of datadnamethy_UCEC.Rdata . . . 76 Figure 4.4 First section of methylation data processing: data cleaning process. Green

rectangular represents intermediate files generated throughout the first sec-tion. Light blue rectangular represents the bash script files implementing each sub step. Detailed bash scripts will be provided in the Appendix C. . . 78 Figure 4.5 Second section of methylation data processing: splitting and selection. Green

rectangular represents intermediate files generated throughout the first sec-tion. Yellow rectangular represents the bash script files implementing each sub step. Detailed bash scripts will be provided in the Appendix C. . . 79 Figure 4.6 “Central dogma" of molecular biology describes two steps: (1) Transcription:

a segment of DNA is transcribed into either coding RNA or non-coding RNA ;(2) Translation: coding RNA is further translated into polypeptide then protein. 80 Figure 4.7 The structure of transcripts_UCEC.Rdata . . . 81 Figure 4.8 Transcripts data set process procedure. Blue rectangular represents the file

that processes the whole process in the dashed border rectangular. Green rectangular represents the files that are needed or produced during or after the whole data process procedure. The detailed R script are given in the Appendix C. . . 82 Figure 4.9 Changes in order and number of genes A, B, C and D vary the human genome,

which makes us different. The right bottom is the copy number variations (CNVs) . . . 84 Figure 4.10 CNVs data processing and splitting procedure . . . 86

Figure A.1 Additional Measures for Figure 2.2. Row Size: 100, 30, 30, 40; Column size: 50 60 50 40. The number of outliers = 0. Other five measures show the similar results as the four measures we reported in the paper. . . 109 Figure A.2 Additional Measures for Figure 2.3. Row Size: 10 100 50 40; Column size: 10

100 50 40. The number of outliers = 0. Other five measures show the similar results as the four measures we reported in the paper. . . 110 Figure A.3 Additional Measures for Figure 2.5. Row Size: 10, 100, 50, 40; Column size:

(16)

Figure A.4 Additional Measures for Figure 2.4. Row Size: 100, 30, 30, 40; Column size: 60, 60, 60, 50. The number of outliers: 4000. We can see JI has the similar problem as VI, which has been discussed . . . 112 Figure A.5 Additional Measures for Figure 2.7. JI behaves similar to VI and the reason

has been discussed. Other additional measures exhibit similar results to the four we reported. . . 113 Figure A.6 Additional Measures for Figure 2.9. Other five measures show the similar

results as the four measures we reported in the paper. . . 114

Figure B.1 order-1 tensor : vector; order-2 tensor: matrix; order-3 tensor: which is a cute and we do not have specific names for any higher dimensional array . . . 121 Figure B.2 Box product vs Kronecker product . . . 124 Figure B.3 Rank selection results for BIC and AIC of Shape "T". Index is the indices of

1000replicaitons . . . 126 Figure B.4 "Triangle" shape with true rank 3 . . . 127 Figure B.5 Rank selection results for BIC and AIC of Shape "Triangle". Index is the

(17)

Chapter 1

Introduction

People compute. Restaurant owners always seek to predict how much customers will consume to know how much food they need to prepare. Investors aim to create the best portfolios to maxi-mize their profits while maintaining a relatively low risk. Furthermore, meteorologists recently adjusted the parameters of their own models to predict the possible path of Hurricane Florence. Nature computes. The solar system tends to seek balance by adjusting the position of each of its celestial bodies. Rays of light always find the routes that minimize their travel time. Moreover, green plants face the direction of sunlight.

(18)

there are more potentialities since the advent of increasingly advanced technologies that have made what were originally impossible issues in statistical computing solvable. For example, com-puters that are strong enough can generate random numbers so that the Monte Carlo method becomes a feasible and simple way to solve many problems, e.g. evaluating the integral of a certain function. MCMC has now become an important research area in Statistical Computing. The main goal of the Statistical Computing is to optimize an objective function accurately and efficiently. Given a practical problem, the procedure of figuring out the objective function, constraints, and variables of interest is called modeling. It is important to think twice before constructing any models. If the model is too simple, the solution will lose its validity in the application. However, if the model is too complexed, it might take years to figure out the solution which makes the modeling pointless. The forms of the objective function not only reflects the goal and the problem of interest of the researcher but also the definition of being a "good" model. A good model can make the optimization procedure much easier and scalable which will be explained in the next paragraph.

(19)

A comprehensive understanding of the problems is needed to design a proper algorithm. After the decision of choosing an algorithm being made, it is also important to let the machine be able to recognize the solution. There are many ways to let a machine do so. Empirically, relative or absolute changing rate of the objective function or the distance between the current estimate and previous estimate can be checked. Also, more rigorously, KKT condition can be applied as well.

My research area is related to the Statistical Computing. More specifically, my research mainly constitutes two topics. One is the application of the low-rank tensor regression to TCGA data analysis. More specifically, we studied the parameter-wise hypothesis testing procedure by investigating the behavior of the test statistics under the tensor regression framework. The other topic discussed the robustness of the biclustering algorithm. The objective function has been redesigned so that the resulting algorithm has stronger resistance to the outliers than traditional approaches. Besides, a simple algorithm has been developed to optimize the newly formulated objective function. Both projects are closely related to the optimization issue in linear regression, penalized linear regression, and their extensions. In the following section, I will briefly give an overview regarding linear regression, penalized regression and their usage in my projects.

1.1 Standard Linear Regression Model

Consider the standard linear regression model fori-th observation,i= 1, . . . , n,

yi =xTiβ+i (1.1)

wherenis the sample size,yi, i= 1,2, . . . , nis the response variable fori-th sample. If we have

p predictors/variables. xi ∈ Rp. xi is the realization of all p predictors for i-th sample whose

j-th element isxij.βis thep-by-1unknown parameter vector. The goal of the algorithm solving

(20)

into a compact form:

y=Xβ+ (1.2)

where y=          y1 y2 ... yn         

,X=

        

x11 x12 x13 . . . x1p

x21 x22 x23 . . . x2p

. . . . xn1 xn2 xn3 . . . xnp

         ,β=          β1 β2 ... βp          ,=          1 2 ... n         

yis a n-by-1 response variable. It can be any continuous type. For example,ycan be any traits in genetic data analysis like status, pattern or growth. We call an-by-pmatrix,X, design matrix. Each column of the design matrix represents a realization of a predictor, specifically, a genetic variable in genetic data analysis. We are interested in the association analysis between a certain trait and a group of genes. In such case, each predictor can be interpreted as a measure of a gene, which is a column of a design matrix.is the random error vector following i.i.d. gaussian

distribution with mean zero and variance σ2. β is the p-by-1 parameter vector. The typical

method to estimate the parameter vectorβ is called Ordinary Least Square (OLS), whose goal

is to estimateβ by compute the optimal point of the following optimization problem:

ˆ

β= arg min

β∈Rp 1

2||y−Xβ|| 2

2 (1.3)

The intuition behind (1.3) is to let the model take care of all the samples{yi,xTi }ni=1by minimiz-ing the squared distance of each observation,(yi−xTi β)2. After setting the first order derivative

to be zero, (1.3) has the well-known solution, _βˆ _{= (X}T_X)−_XT_y_{. Although the solution from}

(21)

Figure 1.1: Blue line is the regression line based on clean data. Red line is the regression line based on contaminated data

In Figure 1.1, the blue solid line is the standard linear regression model based on the data with outliers labeled with blue triangles, yet the red solid line is the standard linear regression model based on the data without outliers labeled with red dots. It is clearly shown that the blue line is dragged towards those outliers. The other drawback of an OLS estimate is that it does not work when the number of parameters is larger than the sample size, that is n < p. There are various ways to understand the issues caused byn < p.p > n will cause the Gram matrix,

XTX rank deficient. As a result, the OLS estimate is not unique and its stability is weak due to the inversion of a matrix that is close to singular. One way to improve the standard linear regression model is to add additional constraints. So the corresponding optimization problem becomes:

ˆ

β= arg min

J(β)≤t

1

2||y−Xβ|| 2

2 (1.4)

(1.4) searches solution within a certain region specified byJ, yet OLS seeks the solution in the

(22)

explained in the next sub-section.

1.2 Penalized Linear Regression Model

The Equivalent unconstraint form of (1.5) can be written as

ˆ

β= arg min

β

1

2||y−Xβ|| 2

2+λJ(β) (1.5)

where λ∈ _R+ is called the tuning or regularization parameter. λcontrols the strength of the penalty. The larger theλis, the stronger the penalty is. A large value ofλis equivalent to set a small value oft in (1.4). Thus, whenλ= 0, the penalized regression model is equivalent to the standard linear regression model. whenλ=∞, it means the model actually ignores the squared

loss part of the objective function, 1

2||y−Xβ|| 2

2. The form ofJ is a user specified and depends on the model purpose. The most common two choices areJ =|| • ||2

2 and J =|| • ||1 and they are called ridge regression and LASSO, respectively. Lasso and Ridge regression are looking for the "small"_βˆ _{in terms of} _{|| • ||}2

2 or || • ||1. The differences of lasso and ridge regression are the shape of support, which can be easily visualized in Figure 1.2.

The typical optimization procedure for solving the ridge regression is the same as the proce-dure for solving a standard linear regression model since the ridge penalty term can be absorbed into the squared loss part. It can be easily seen in the following simple algebra:

min

β

1

2||y−Xβ|| 2

2+λ||β||22 = min

β

1

2||y−Xβ|| 2

2+λ||0−Iβ||22

= min β 1 2    y 0   −    X λI   β 2 2 = min β 1 2||y

(23)

Figure 1.2: The objective functions in R2 of Ridge Regression and Lasso and their support from two angles. The region lies in dark red cylinder is the support of ridge regression, yet the dark green cuboid is the support for lasso. The purple dot is the OLS estimate. The contours of OLS objective function is in rainbow color. Whatever approaches is looking for the global minimum of the OLS objective function, yet penalized regression approaches are looking for the global minimum within the support defined byJ.

wherey∗ =



 

y 0





 and X

∗ ₌



 

X λI





. Ridge regression has been converted to a standard linear

regression.

As for lasso, there is no explicit solution in general (we do only when the columns of the X are orthogonal to each other). However, there are many iterative approaches works quite well for solving it such as proximal gradient algorithm, ADMM, MM, even gradient descent algorithm plus some additional conditions during the iteration), which have been discussed in details in many of the textbooks. Therefore we will omit them in the introduction section.

The objective function of the standard linear regression and the penalized regression can be extended to more general fitting settings.

min

B

1

2||Y−XB|| 2

(24)

Many of the problem can be converted into an optimization problem using (1.6) as the objective function such as matrix approximation, low-rank decomposition, and image smoothing. One of the most important extensions is the convex formulated clustering and biclustering problem.

1.3 Biclustering And Convex Biclustering

Speaking of Biclustering, it is requisite to cover some basics of clustering at first. Clustering is definitely an important fundamental data mining technique. The goal of the clustering is to group a set of objects according to some certain similarity measure or rule such as the Euclidean distance. Clustering has been applied in many areas such as image recognition and unsupervised genetic data analysis. Mathematically, the standard clustering problem can be formulated as follows: Suppose we have n objects, x1,x2, . . . ,xn. We aim to cluster n objects

according to some similarity measure based onpfeatures, which means each of thexi is ap-by-1

vector whosej-th element is the score in terms ofj-th feature. Due to such formulation, we can form a n-by-p data matrix, X∈ _Rn×p_{, whose rows are samples and columns are features. The}

interpretations of rows and columns can vary. For example, rows can also represents different experimental conditions. The goal of clustering in terms of the data matrix setting is just to group rows according to some similarity measure based on the given p-features and shown in Figure 1.3.

(25)

The view of clustering problem as grouping the rows of a data matrix make it much easier to generalize to a concept of biclustering.

Biclustering is a problem of clustering seeking homogeneous groups simultaneously in both columns and rows. Unlike the clustering looking for latent structure hinged on the whole feature space, Biclustering allows the existence of the grouping based on the subspace of the whole feature space. Simply put, biclustering clusters a data matrix along both rows and columns simultaneously, thus the name "Bi-clustering".

There has been an influx of application examples of biclustering such as voting data (Harti-gan, 1972), Collaborative filtering recommendation system (Elnabarawy, Wunsch, and Abdelbar, 2016) and Lung cancer data (Yun and Yi, 2013; Chen, Zou, Lu, and Chen, 2014; Chi, Allen, and Baraniuk, 2017), as reviewed in Xie, Ma, Fennell, Ma, and Zhao (2018). Meanwhile, there has been also plenty of biclustering algorithms suggested, as summarized in Prelić, Bleuler, Zimmer-mann, Wille, BühlZimmer-mann, Gruissem, Hennig, Thiele, and Zitzler (2006); Eren, Deveci, Küçüktunç, and Çatalyürek (2012); Oghabian, Kilpinen, Hautaniemi, and Czeizler (2014); Pontes, Giráldez, and Aguilar-Ruiz (2015); Roy, Bhattacharyya, and Kalita (2015). In the genomic data analysis, the most popular biclustering algorithm is the clustered dendrogram. Hierarchical clustering (Hastie, Tibshirani, and Friedman, 2009) was performed on the patients (columns) and on the genes (rows), respectively. As pointed out by Chi et al. (2017), the clustered dendrogram, as a biclustering algorithm, has two properties that make it less appropriate for generating repro-ducible results. First, a greedy search algorithm is used to construct the dendrogram so that the clustered dendrogram will provide a local optima result in terms of a certain error criterion. Therefore, the algorithm depends heavily on the initial setting. The algorithm is also not robust in the sense that small perturbations in the data can lead to completely different biclustering result, not to mention outliers.

(26)

non-constraint objective error criterion:

min

U

1

2||X−U|| 2

F +λJ(U) (1.7)

whereU is an-by-pmatrix needed to be estimated,Xis an-by-pdata matrix. (1.7) is a special case of (1.6). It can be viewed as a penalized regression but with an identity design matrix and multiple response variables. The penalty J steers the solution toward a checkerboard pattern.

(1.7) is convex as a function of U. Consequently, it has a unique solution, which makes the estimate from (1.7) does not depend on the initial setting. Moreover, Chi et al. (2017) proved the solution to (1.7) is jointly continuous in (X,λ). Continuity inUmakes (1.7) more stable against small perturbations. However, (1.7) still lack resistance to the outliers. As shown in Figure 1.1, outliers can still bias the solution to (1.7) due to the characteristic of Euclidean distance as a measure of similarity. The chapter 2 of this dissertation proposed a new formulation of the biclustering objective error criterion while sacrificing the convexity in exchange for the great robustness under the checkerboard pattern assumption.

Except for biclustering, association study in genetic data analysis is also an important direct application of standard linear regression and penalized regression problem.

1.4 Multi-Platform Genomic Data Analysis

(27)

been proposed to detect the hidden genetic variation behind complex traits. For example, DNA sequences variation can be detected by the linkage analysis and through association studies. Moreover, the association studies between phenotypes and variations in other high-throughput platform datasets such as RNA levels, methylation pattern sequencing, and protein variation comes more and more popular and is treated as routine. At first, each platform of the data has been analyzed separately to look for the association between the phenotypes and genetic variation, yet with these methods for multiple platforms data sets, we have already assembled different pieces of puzzles from each platform of data into a big picture. Still, there are much of complicated biological process behind certain trait remains unclear, which could be due to the current disadvantages of the methodologies for multi-platform data sets.

(28)

genomic variables, especially the sample size is usually moderate either due to research resources or missing data. Various strategies have been proposed to address the issue of high-dimensional genomic variables, e.g., Bayesian approaches (e.g., Wang et al. (2012)), penalization regressions (e.g., Liu, Huang, and Ma (2013); Shi, Liu, Huang, Zhou, Shia, and Ma (2014); Ma and Huang (2009); Ma et al. (2011); Jiang, Shi, Zhao, Krauthammer, Rothberg, and Ma (2016)), and dimension-reduction based methods, e.g., principal component analysis (PCA)Meng, Zeleznik, Thallinger, Kuster, Gholami, and Culhane (2016). Joint-modeling based methods can be further classified into (1) "tensor-based" and (2) "transformation-based" approaches depends on the approaches of aligning data platforms.

Figure 1.4: Two categories under the "joint-modeling" based analysis: tensor based (left side) and transformation based (right side). There are other categories such as iBAG (model based, Wang et al. (2012)).

(29)

analysis and depends on the exact procedure in each step. "Meta-based" methods have risking of losing information when it assumes the independence among different platforms, genes, or gene sets. Even in "joint-modeling"-based analysis that are based on a tranformation, an inap-propriate transformation of the data will project the data into a space with lower dimensions but incomplete information.

(30)

Chapter 2

Robust Biclustering: A method based

on the Minimum Distance Estimator

2.1 Introduction

Biclustering, as known as co-clustering or two-mode clustering, is one of the most commonly used statistical tools in the analysis of high-dimensional datasets. Biclustering identifies and groups homogeneous objects but it allows the existence of sub-clusters within each bicluster. Biclustering has been used in a wide array of areas. For example, in text mining, biclustering can be used to detect subgroups of documents with similar properties with respect to a sub-group of words(Dhillon, 2001). In addition, in gene expression data, biclustering can be used to identify a subgroup of genes with similar conditions. A complete review for the biclustering with application to gene expression data can be found in (Beatriz Pontes, Raul Giraldez, 2015). In this chapter, we assume the data can be partitioned into a checkerboard-like pattern.

(31)

this suggests an optimization problem:

min

U∈Rn×p 1

2||X−U|| 2

F

X

i<j

1U(:,j)6=U(:,i)≤t

X

i<j

1U(j,:)6=U(i,:) ≤t (2.1)

where X ∈ Rn×p, U is the hidden checkerboard-like structure (a block matrix each of whose block is a constant matrix) and t is the largest possible number of row/column clusters that allowed in the model. For example, for a 2-by-2 data matrix case, the hidden checkerboard-like structure U looks like:

U =



 

µ11111T3 µ12111T4 µ21121T3 µ22121T4



 

where the size of 1i, i= 1, . . . ,4 depends the size of bicluster and the scalar µij is the centroid

of the (i, j)bicluster.

However, 2.1 is very sensitive to not only outliers but also even a small perturbation in the data. One of the reason is that hard thresholding constraints themselves are too strict since it dose not allow any “grey" area. Soft thresholding constraints have been proposed as an alterna-tive in order to improve the stability of 2.1. As indicated by its name, soft threshold constraints relaxes the criterion by allowing a small difference,. It suggests the following modified version of 2.1:

min

U∈Rn×p 1

2||X−U|| 2

F

X

i<j

1{||U(:,j)−U(:,i)||2≤} ≤t

X

i<j

(32)

Then the equivalent non-constraint weighted version of 2.2 can be written as:

min

U∈Rn×p 1

2||X −U|| 2

F +λ

X

i<j

ωij||U(:, j)−U(:, i)||2+

X

i<j

˜

ωij||U(j,:)−U(i,:)||2

whereωij andω˜ij are the weights to adjust the penalty strength of each pair-wise difference.

The very initial idea of the biclustering comes from the term “Direct Clustering” introduced by Hartigan (1972) with application to the republican vote data. The term “Biclustering” is later used by Mirkin. But the term “Biclustering” became more popular after Cheng and Church (2000) applying biclustering approach on a gene expression data set, one of the most important literatures in the gene expression biclustering field. Afterward, biclustering has been extensively studied and there are huge amount of work such as Plaid model Lazzeroni and Owen (2002), Spectral Biclustering Kluger, Basri, Chang, and Gerstein (2003), FLOC Yang, Wang, Wang, and Yu (2003), Coupled Two-way Iterative Method (CTWC) Getz, Levine, and Domany (2000), Non-smooth Non-Negative Matrix Factorization Carmona-Saez, Pascual-Marqui, Tirado, Carazo, and Pascual-Montano (2006), SSVD Lee, Shen, Huang, and Marron (2010), Sparse Biclustering Tan and Witten (2014) and COBRA Chi et al. (2017). Compared to clustering, biclustering has many advantages. One of the most important advantages over clustering is much more flexible which allows the existence of sub-group. For more detailed list of Biclustering work, please see the review papers written by Madeira and Oliveira (2004); Busygin, Prokopyev, and Pardalos (2008); Pontes et al. (2015)

While extensive efforts have been put on accelerating and stabilizing the biclustering algo-rithms, noticeably less attentions have been given to extending these methods to handle extreme perturbations, which is called “outliers.” A simple motivation to study robustness is from a nat-ural question: given a data matrix with randomly located outliers, are we still able to recover the checkerboard-like pattern accurately and efficiently?

(33)

identifies outliers by applying some techniques before applying any of the biclustering approaches to the dataset. The second step is to implement a biclustering algorithm on the result of the first step. This approach has a risk of losing information while throwing away so-called “outliers." A more dedicated way is to use Robust Principal Component Analysis Partridge and Jabri (2000); Wright, Ganesh, Rao, Peng, and Ma (2009); Candès, Li, Ma, and Wright (2011) by assuming the outliers are the minority compared to sample size and any non-zero entries in the sparse matrix of the results will be treated as outliers. But the ugly part of this more dedicated approach is that we have to choose a tuning parameter. Additionally, all those approaches will vary the actual data set based on different pre-selected outlier-detectors. As a result, the actual data that a biclustering algorithm applied to will be very subjective. The major advantage of the automated approach is that the actual data set we are going to work with will not vary and it has less risk of losing information compared to abandon parts of the data labeled “outlier.” It is an automated process combining dealing with outliers and implementing biclustering algorithm. The approach we are going to present in this chapter is an automated approach.

Many of the error criterions (the objective function of a biclustering minimization problem) of the biclustering algorithm can be formulated as a Frobenius norm that measure the disparity between an estimation and dataset along with penalty terms that impose the sparsity on the solution (there are many other types of the biclustering, i.e., graph theory based). Usually, biclustering approaches are not doing great job when we have outliers since most of them are based on least squre based which is known sensitive to atypical observations. Least square as a measure for distance actually aims to take care of all of the data: normal data and outliers as well. A natural extension of the current biclustering approach to gain robustness is to find an approach only take care of the normal data while ignoring all or part of the outliers. In this chapter, we construct such kind of the automated approach via the minimum distance approach. We will use the following model to construct the outliers.

(34)

error term:

X = U+C+E (2.3)

where U is the ground truth having a checkerboard-like structure, C is a sparse matrix that contains randomly located outliers, Eis a noise matrix that follows a multivariate distribution. When E∼N(0, σ2I), the data matrixX follows a normal distribution with mean U+C and varianceσ2I. Particularly, whenC=0

¯, there are a massive number of classical approaches that are capable of recovering the estimation that has the checkerboard-like structure accurately and efficiently. Are we still able to recover the checkerboard structure,Uas good as the results from the classical approaches while we have not only Ebut also C? we will try to give an answer to this question.

The mean shift outlier model has been widely used as a start of many outlier related analysis. For example, it was used earlier for the outliers detection problem using the linear model Wei and Fung (1999); McCann and Welsch (2007); She and Owen (2011); Candès et al. (2011). Non-zero Ccases are not very uncommon in the real world. For example, tissue samples may be mislabeled or contaminated due to some careless reasons. It is also possible that the tissue sample dataset may exhibit different statistical properties under different situations that are not of interest. As a result, these outliers or non-constant statistical properties may bias biclusters results dramatically.

2.2 A Brief Introduction to the Parametric Minimum

L

2

Error

Criterion

(35)

distance error criterion, is used to estimate the bin width of the histogram in the nonparametric least squares cross-validation problem Rudemo (1982); Bowman (1984). Motivated by its ro-bustness and and its usage in estimating bin width, Scott (2001) proposed a robust parametric unbiased estimation by minimizing Integrated Squared Error (ISE) which is also a special case of the minimum distance criterion. The derivation of L2E in the parametric modeling is given below:

Consider minimizing the following ISE:

Z

f(x|θ)−f(x|θ0)

2

dx

After some algebra, we have:

ˆ

θL2E = argmin

θ

Z

f(x|θ)−f(x|θ₀)

2

dx

= argmin

θ

Z

f2(x|θ)−2f(x|θ)f(x|θ0) +f2(x|θ0))

dx

= argmin

θ

Z

f2(x|θ)−2f(x|θ)f(x|θ₀)

dx

= argmin

θ Z

f2(x|θ)dx−2

Z

f(x|θ)f(x|θ₀)dx

= argmin

θ Z

f2(x|θ)dx−2Eθ0f(x|θ)

= argmin

θ Z

f2(x|θ)dx− 2

n

X

i=1

f(xi|θ)

wherefis assumed statistical model for the data,θ0is the true value for the unknown parameter

(36)

along with the MLE for comparison purpose.

ˆ

µM LE = argmaxµ n

X

i=1

logφ(xi|µ,1) (2.4)

ˆ

µL2E = argminµ

1 2√π −

2 n

n

X

i=1

φ(xi|µ,1) (2.5)

Note the difference between MLE and L2E, MLE is to maximize the product of the densities while L2E maximizes the sum of the densities. Scott (2001) also showed the robustness of L2E by showing boundedness of the influence function of L2E while the influence function of MLE is unbounded.

2.3 A Brief Introduction to MM Algorithm

MM algorithm Hunter and Lange (2004) generalized the idea of EM algorithm. The basic idea of the MM algorithm is to convert a hard problem (probably not convex) to a sequence of simpler sub-problems and we will see how this works later. The term “MM" can be interpreted as Minorization-Minimization or Majorization-Maximization depending on whether we are looking for a minimum or a maximum. A hallmark of the MM algorithm is its numerical stability because every iteration will be guaranteed to make progress toward the ground truth. Suppose we are going to find the minimum of a convex functionf(θ). A functiong is called a majorant off at ˜

θ if we have the following conditions:

f(θ) ≤ g(θ|θ)˜ (2.6)

f(˜θ) = g(˜θ|θ)˜ (2.7)

(37)

majorant at each iteration; (2) minimizing the majorant conditioning on the current estimate. A complete MM algorithm procedure inR2 is illustrated in the following figure

Figure 2.1: Compete MM procedure looking for the minimum value of x∞ = argminxf(x),

The thick solid curve isf(x). The dash curve are majorants based on each updatexi

2.4 L

2

Error Biclustering Criterion and Its Properties

Motivated by the robustness of L2E shown in the parametric modeling field, we are hoping to introduce its robustness into Biclustering problem. Assume a ground truthUhas a checkerboard structure. This is because, in the biclustering setting, our goal is to cluster the rows and the columns of a data matrix simultaneously. If we reorder the rows and the columns of data matrix according to the bicluster assignment, there will be a checkerboard pattern, i.e. a block matrix with each block a constant submatrix. Denote the data matrix as X ∈ _Rn×p _{consisting of} _n

(38)

xij =µ0+µrc+ij whereµrc is the centroid of r-th row group,Rr, and c-th column group,Cc,

µ0 is the overall baseline mean shared by every entry and ij ∼ N(0, σ2). Then each entry of

the data matrix follows a normal distribution with mean,µ0+µrc and variance,σ2. Suppose we

have Kr row groups and Kc column groups in total, we havep=PK_i₌₁c |Ci|and n=PK_i₌₁r |Ri|

Denote the latent checkerboard structure matrix, U, then we haveU ={Urc}(r,c)∈R×C where

Urc=µrc1|Rr|1

T

|Cc|∈R

|Rr|×|Cc|

Motivated by (2.5), we propose to identify the latent checkerboard structure by minimizing the following penalized ISE. DenoteFλ:Rn×p×R++→R, thenFλis the combination of distance

measureψ:Rn×p×R++→Rand regularization term J :Rn×p →R :

Fλ(U, τ) = ψ(X−U, τ) +λJ(U) (2.8)

= τ

2√π − τ np r 2 π X i,j

e−12τ 2₍_x

ij−uij)2

| {z }

ISE

+λJ(U)

| {z }

penalty

(2.9)

whereJ(U) =P

i<jωij||U(:, i)−U(:, j)||2+

P

i<jω˜ij||U(i,:)−U(j,:)||2,U(:, i)(U(i,:))denote

the i-th column (row) of U and ωˆij = ιk_{_i,j_}exp(−φ|||X(:, i)−X(:, j)||22). The weights is the product of two terms: an indicator and a Gaussian kernel measuring the similarity. The idea of weights here follows the same idea of weights in COBRA setting. The first term ιk_{_i,j_} is 1 if j is among k’s k-nearest-neighbors or vice versa and zeros otherwise, which will impose sparsity of the weights Chi et al. (2017).

(39)

The penalty parameterλtunes the tradeoff between the ISE and the penalty term. The weights

ωij =ωji and ω˜ij = ˜ωji are non-negative and will be explained in details later.

The regularization term J(U) or similar ideas have been widely used in many papers that aim to impose sparsity. Suppose J(U) only regularizes rows(columns), if all of weights are the same, the LEBRA error criterion becomes a summation of measure of distance with a fused penalty on columns, which is very similar to the idea of convex clustering described in Pelckmans, De Brabanter, Suykens, and De Moor (2005). If we setψ=|| • ||and a non-uniform weights are,

say, Gaussian kernel, the LEBRA error criterion is similar to the relaxed version of hierarchical clustering in Hocking, Joulin, Bach, and Vert (2011) and the extensions described in Lindsten, Ohlsson, and Ljung (2011). Moreover, If we add J(U) regularizes both rows and columns, the LEBRA error criterion will be the same as the error criterion in COBRA Chi et al. (2017)

The regularization parameter controls how strong the strength is to impose checkerboard structure on U. If λ= 0, there is no strength of imposing any checkerboard pattern, then the best estimate of U would be trivially X, the data matrix itself. Ifλ = +∞, it indicates that

the strength of imposing the checkerboard structure onU is infinitely strong. In other word, it is not allowed the existence of any different clusters in both rows and columns, then the result would be the simplest checkerboard structure if all the weights of the pairwise differences among both rows and columns are not zeros, a constant matrix. Then, U will be a constant matrix if

∀(i, j), ιk_{_i,j_} = 1. The approach of choosing a balancedλwill be discussed shortly.

Sparse Gaussian kernel has been used to construct both the weights term ωij and distance

metric. As for weights, it is mainly because of computation reasons. In fact, any kernelκ(xi, xj)

will do. The idea of using kernel is to make the cluster path sensitive to the local density in the data Hocking et al. (2011); Lindsten et al. (2011); Chi et al. (2017). As for the distance measure in the ISE, the usage of Gaussian kernel comes from an implicit assumption that the data matrix follows a Gaussian process. We will see this implicit assumption in the following discussion.

(40)

useful for developing an intuition about the robustness of the LEBRA. Recall the LEBRA error criterion, given fixedλ,

Fλ(U, τ) =

τ 2√π −

τ np

r

2 π

X

i,j

e−12τ 2₍_x

ij−uij)2

| {z }

ISE

+λJ(U)

| {z }

penalty

(2.10)

Since J(U) imposes a checkerboard structure on the estimate, Uˆ. Without losing generality, suppose we have already known that there are total number ofKbiclusters. Then we can restrict our attention only to the matrices that have K squares structure denoted as ΓK ( Rn×p. In other word, we no longer need the term,J(U) and we have the following equivalent problem.

argmin_τ∈_R,U∈_Rn×pF_λ(U, τ) = argmin_τ_∈

R,U∈ΓKψ(X−U, τ) (2.11) Notice that the equivalent problem on the right side has no penalty term. For convenient purpose, we let the indices of all of the biclusters in _Uˆ _{follow column major. Therefore, given total} number ofK =KrKc biclusters, the location ofxrc can be expressed in termsK, Kr, Kc. The

corresponding row index is (K modKr) and the corresponding column index is (K modKc).

(41)

in the(K modKc)th column group and the (K modKr)th row group.

ψ(X−U, τ) = τ 2√π −

τ np r 2 π K X k=1 X

(i,j)∈k e−12τ

2₍_x

ij−uk)2 (2.12)

= K X k=1 τ 2K√π −

τ np r 2 π X

(i,j)∈k e−12τ

2₍_x ij−uk)2

(2.13) = K X k=1 τ 2K√π −

τ

PK

i=1|i|

r

2 π

X

(i,j)∈k e−12τ

2₍_x ij−uk)2

(2.14) = K X k=1 τ 2K√π −

τ Kq r 2 π X

(i,j)∈k e−12τ

2₍_x ij−uk)2

(2.15) = 1 K K X k=1 τ 2√π −

τ q r 2 π X

(i,j)∈k e−12τ

2₍_x ij−uk)2

(2.16)

Set τ = _σ1 and uk = µk (To be consistent with the notations in Scott (2001)), then (2.16)

becomes 1 K K X k=1 1 2σ√π −

1 σq r 2 π X

(i,j)∈k

e−2σ12(xij−µk) 2

(2.17)

(42)

distance estimator.

One interesting relationship need to be noticed is that there is an inverse relationship between

λand τ. Also the range of λ is actually lower bounded. To see this, let us assume λ= 0, then ˆ

U would be X itself. Then every single entry of X, xij, belongs to its own bicluster. The

variance with each bicluster with one entry is zero. Sinceτ can be interpreted as an inverse of the standard deviation. Then τ = +∞ in this case. Therefore, in order for LEBRA return a

meaningful solution, there exists a lower boundB such thatλ∈(B,∞)

2.5 Connection to the M-estimator

From the previous section, we know the formulation of LEBRA is closely related to the Minimum Distance Estimator. Actually, there is another way to gain the intuition of the formulation of the LEBRA objective function. Yohai and Zamar (1988) defined a new class of rho function satisfying the following six conditions:

1. ρ(0) = 0

2. ρ(−u) =ρ(u)

3. 0≤u≤v implies that ρ(u)≤ρ(v)

4. ρ(t)is continuous

5. Leta= supρ(u); then0< a <∞

6. Ifρ(u)< a and0≤u < v, thenρ(u)< ρ(v)

Apparently, the ρfunction in the ISE part of LEBRA objective function satisfies all the condi-tions except the first one, I would call its a generalized version of theρ function. Recall the the ISE part:

τ 2√π −

τ np

r

2 π

X

i,j

e−12τ 2₍_x

(43)

After some algebra, we have:

τ np

X

i,j

ρ(τ rij) (2.19)

where ρ(t) = 1₂ −

q

2

πe

−1₂τ2_t2

. In our case, ρ(0) = 1₂ −

q

2

π. Our problem as a scaled version

of MM estimate (Yohai, 1987) from the M estimate. It has been applied in many other fields as a robust alternative. For example, the robust ridge regression (RRR) replaced the Euclidean distance with a specific form of ρ(t) function along with some constants need to be computed correspondingly. Besides Ridge Regression, Robust LASSO applied the same idea to the original LASSO problem but using Tukey’s Biweight Criterion. Whatever the issues are, robustness comes from the boundness of the ρ(t). For detailed explanation, please check out the High Breakdown-Point Estimates of Regression by Yohai and Zamar (1988).

2.6 Estimation of Biclusters with LEBRA

Block relaxation method is a class of fixed point methods. It has a very long and complicated history. For more detailed discussion about block relaxation method, please see De Leeuw (1994). In this article, we only focus on the application of block relaxation method on our problem. Recall the formulation of LEBRA problem, it contains two variables: U and τ, we are going to use block relaxation method to minimizeU and τ alternatively.

From now on, X is a n-by-p fixed data matrix with n rows and p columns. The weights W and W˜ are only based on the rows and columns of X, therefore they are fixed as well. The LEBRA can be seen as a bivariate function onRn×p×R++. Block relaxation method will require solving subproblems onRn×p andR++, respectively. We are going to illustrate the approach of minimizing each of them.

Updating τ:Recall the LEBRA error criterion is literally an univariate function ofτ, given ˜

(44)

reasonable large number value. Estimateτ is to solving the following minimization problem:

ˆ

τ = argmin

τ

τ 2√π −

τ np r 2 π X i,j

e−12τ 2₍_x

ij−uij)2 ₊_λ_J_(U) (2.20)

= argmin

τ

τ 2√π −

τ np r 2 π X i,j

e−12τ 2₍_x

ij−uij)2 (2.21)

Updating U: The update procedure of LEBRA for U is actually a COBRA problem on a “modified data." COBRA is developed by Chi et al. (2017), which transfers an ordinary biclustering problem into a penalized convex optimization problem. The COBRA error criterion is defined as following:

COBRAλ(X) = argmin

U

1

2||X−U|| 2

F +λJ(U) (2.22)

where the notations are defined as the same as we did for the LEBRA. COBRA plays an important role in the estimation of LEBRA, and we will see in details in the later discussion.

Fixedτ, LEBRA error criterion can be viewed as an univariate function ofU. Then we have the following proposition:

Proposition 2.6.1 Given fixed τ, the ISE part of the LEBRA error criterion has a bounded hessian matrix.

The proof of Proposition 2.6.1 will be given in Appendix A.1. Given Proposition 2.6.1, Proposition 2.6.2 tells us how to find a majorant of Fλ and the way to solve it. The proofs of

them will be given in Section A.1.

Proposition 2.6.2 Given fixedτ, a majorant of Fλ,Gλ :Rn×p →R, has the form:

Gλ(U|U)˜ =

√

2τ L np√π

1

2||U−X

∗_||2

F +γ(λ)J(U)

(45)

where C˜ is a constant not related to either U or τ,

γ(λ) = np

√

π

τ√2Lλ (2.24)

X∗ = U˜ +K˜ (2.25)

˜ K =

τ2

L(xij−u˜ij)e

−τ2(xij₂−uij˜ )2

(i,j)

∈_Rn×p _(2.26)

It turns out that the majorant coincides with the form of the COBRA error criterion. There-fore, minimizing majorant with respect toUin each LEBRA iteration is just a COBRA problem biclustering over the “modified" data X∗ =U˜ +K˜ with a tuning parameter γ(λ).

As for L, we can find the estimation of Lby the following algebra:

∇2

−e−

τ2(_xij−•)2 2

(uij) = τ2e−

τ2(_xij−_uij)2 2

1−τ2(uij −xij)2

≤ τ2e−

τ2(_xij−_uij)2 2 ≤τ2

= L

The robustness of LEBRA lies in the formulation ofGλ and we will explain this in details later.

L2E BiclusteRing Algorithm (LEBRA): After we have an algorithm to minimize two subproblems, we are now ready to have LEBRA:

Algorithm 1 L2E BiclusterRing Algorithm (LEBRA)

1: procedure_LEBRA(U(0),τ(0)) .InitializeU(0) and τ(0)

2: repeat

3: τ(i+1) = GSA(τ(i),U(i))

4: U(i+1) = COBRAγ(λ)(U(i)) . γ(λ) = np

√

π τ(i+1)√₂_Lλ

5: untilconvergence

6: end procedure

(46)

the more we are close to the ground truth, LEBRA behaves more closer to an algorithm with an Euclidean loss function. For convenient purpose, let us assume σ = 1. Recall the Proposi-tion 2.6.1, under certain condiProposi-tions of ProposiProposi-tion 2.6.1, the error criterion in LEBRA can be written into (2.27). Now let us apply quadratic Taylor expansion to the term −e−(xij−u2k) in (2.27) as a function of xij at xij =uk if (i, j) ∈ k. Then (2.27) can be rewritten further into

the following form approximately

1 K

K

X

k=1

1 2√π −

1 q

r

2

π||xk −uk1|| 2

(2.27)

whereuk ∈R

|k|×1. (2.27) indicates LEBRA is equivalent to a scaled Euclidean distance based biclustering algorithm when we are closer enough to the ground truth. The difference is a high order term of the distance between the data matrix and the ground truth. In other word, LEBRA gains strong resistance to outliers while sacrificing many good properties such as convexity, but its local behavior is similar to a scaled COBRA, a convex continuous problem.

2.7 Robustness of LEBRA

The robustness of LEBRA is very intuitive and it lies in the formulation ofGλ. We have already

mentioned its robustness actually comes from the usage of minimum distance estimator. From algorithm aspect, we can also have a good understanding where the robustness comes from and how LEBRA actually works. After some simple algebra, we can simply write the quadratic majorant in the following form:

Gλ(U|U)˜ =

τ L

np√2π||U−( ˜

QX+U˜ (11T−Q))˜ ||2

F +λJ(U) (2.28)

wheredenotes element-wise matrix multiplication,Q˜ =Q(U) =˜ {exp(−τ₂2(xij−u˜ij)2)}i,j ∈

(47)

function of the previous update of U and Q˜ tends to be zero when there are relatively large disagreement between the data matrix and previous estimation, while _Q˜ _{is much closer to one} when there is a perfect agreement between the data matrix and the previous estimation.

The essence of the LEBRA is actually a COBRA but more dynamic in terms of the data matrix that it works on throughout the algorithm. The data matrix LEBRA is working over keeps adjusting as the LEBRA proceeds. LEBRA uses the convex combination of the data matrix X and the previous update U instead ofX all the way like COBRA does. As LEBRA iterates, LEBRA not only dynamically updates the tradeoff between the real data matrix and tuning parameter via _Q˜ _{that LEBRA is working on, but it also adjusts the values of} _Q˜ _itself throughout the whole algorithm.

An outlier at certain location within a certain bicluster tends to have relatively large dis-agreement with its centroid. In this case , such disdis-agreement tends to produce a small value of _Q˜

ij. Given small Q˜ij, this indicates LEBRA algorithm tends to “trust” estimation from

al-gorithm more than the data matrix itself at certain ith row and jth column. In other word, LEBRA is just trying to cluster the data using “important” part instead of using all data. And the weights control how important each entry is. This idea is very similar to the idea of median or trimmed mean as a substitutes for average value to estimate the mean. The only difference here is that the definition of importance is defined based on the Gaussian kernel and the data driven while the trimmed mean and the median are very subjective.

2.8 Convergence of LEBRA

With the compactness assumption of the LEBRA sequence {θ(k)}_k>₀={τ(k),U(k)}_k>0, we are

able to prove that there exists a convergent subsequence of the LEBRA sequence guarantee to converge a critical point. The LEBRA convergence theorem is given below:

(48)

Ξ and LEBRA mappingφ, assuming a bounded sequence {θ(k)}∞

k=1 is generated which satisfies

θ(k+1) = φ(θ(k)) (2.29)

Then either the LEBRA stops at the point where a solution is identified or there exists such a

k so that for all k+j(j ≥1) there is a convergent subsequence of {θ(ik)}∞_k₌₀ whose limit is the fixed point of LEBRA mapping,φ.

The proof of Theorem 1 are given in Appendix A.1.

2.9 Model Selection

Not all biased models are better than the unbiased model, the goal of using penalized model is to sacrifice the unbiased property in exchange a good reduction in estimation variance. This is why we need an appropriate approach that is capable of selecting the best model. Model selection is always a complicated problem for any problems. So is in both clustering and biclustering problem. For the simplicity, we will use random hold-out cross-validation method.

Random Hold-out Cross-validation: We will randomly select a set of entires from the data matrix and put their index pairs as a hold out set. The corresponding entries in a hold-out set will be taken hold-out and impute with zeros instead. LEBRA will be implemented for a sequence of tuning parameters and choose the specific tuning parameter that can minimize the prediction error for entries in the hold-out set. This idea was first proposed by Wold (1978) for model selection in principal component and analysis and has been used more recently to select tuning parameters in matrix completion problem Mazumder, Hastie, and Tibshirani (2010) and convex biclustering problem as well Chi et al. (2017). The difference between LEBRA and COBRA in choosing tuning parameter is that the distance function we used. Euclidean distance is used for COBRA while the ISE is used for LEBRA. Denote a hold-out index pair Θ⊂Ω =

(49)

Following Chi et al. (2017)’s convention, we choose a small hold out set relative to the original data, say 10%of the the data. Then we minimize the following criterion over U and τ:

τ 2√π −

τ np r 2 π X i,j

e−12τ 2₍_P

Θc(xij)−PΘc(uij))2 ₊_λ_J_(U) (2.30)

for a sequence of λ’s, Λ, ordered from smallest to the largest. We choose theγ minimizing the prediction error over the hold out setΘ, i.e.

λ∗ = argmin_λ∈Λ τ0 2√π −

τ0 np r 2 π X i,j

e−12τ 2

0(PΘc(xij)−PΘc(uij))2 (2.31)

whereτ0 is the estimation from solving the hold out minimization.

Solving the Random Hold-out Cross-validation Problem: The problem of minimiz-ing (2.30) can be viewed as a LEBRA version matrix completion problem. LEBRA version is because of the difference in φ. Recall the traditional matrix completion problem uses the Eu-clidean function as φwhile LEBRA uses Integrated Squared Error. The following pseudo-code summarizes a simple algorithm for reducing the LEBRA version matrix completion problem to solving a sequence of LEBRA problems.

Algorithm 2 LEBRA with missing data

1: procedure_{LEBRA hold out}(U(0),τ(0)₎ _._Initialize_U(0) _and _τ(0)

2: repeat

3: M =P_Θc(X) +P_Θ(U(i))

4: (τ(i+1),U(i+1)) =LEBRA(τ(i),M)

5: untilconvergence

6: end procedure

(50)

problem for each iteration. This approach is the same as the model selection model described in Chi et al. (2017), similar to the soft-impute matrix completion problem in Mazumder et al. (2010). The difference is that they use Euclidean distance while we are using Integrated Squared Error. For each iteration, the subproblem is actually a MM problem Lange, Hunter, and Yang (2000). Please see WEB APPENDIX D in Chi et al. (2017) for the proof of the derivation of the algorithm of solving the LEBRA with missing value.

2.10 Simulation Studies

In this section, we compare LEBRA with COBRA and Sparse Biclustering (SpBC). Nine mea-sures have been used to assess the quality of biclustering estimation. But because of the lim-itation of the paper length, we have only reported four of them. The results in terms of the other five measures along with the four measures shown in this section can be found in Ap-pendix A.2. Those four are: Rand Index (RI), Adjusted Rand Index (ARI), the Variation of Information (VI) and the Normalized Mutual Information (NMI). COBRA is implemented in the R package: cvxbiclustr; SpBC is implemented in the R package: sparseBC. COBRA solves the biclustering problem by minimizing a convex error criterion with the Frobenius norm as the measure of the distance of the matrix in the loss function combined with a penalty term penal-izing the pairwise differences among rows and columns, respectively, while the SpBC method minimizes the same loss function as COBRA does but imposes the l1 penalty on the number of centroids instead. All parameters are selected according to the documentation written in the corresponding R package.

(51)

less than log(K), where K is the maximum number of biclusters. The difference between ARI and RI is ARI not only counts the agreements between two biclustering estimations, but also makes an adjustment for the chance purpose. Both RI and ARI are pair-counting-based measures. The Variation of Information (VI) or Shared Information Distance is based on the concepts of entropy and information. It is the difference between the information shared by two biclustering estimations and the total amount of information carried by each biclustering estimation. If two biclustering estimations are completely the same, we can expect the shared information would be the same as the information carried by each of them, i.e.V I = 0. Unlike RI and ARI, VI has been proved to be a metric that defines the distance between two biclustering estimations. The triangle property and symmetry of VI are extremely useful when interpreting the VI scores. In addition, the advantage of having a bounded metric with a known bound is that we can have the definition of “small” in terms of VI. Meilă (2007) found the nearest neighbor of a cluster-ing estimation in terms of the VI metric is clustercluster-ing result obtained by splittcluster-ing or mergcluster-ing small clusters, which is consistent with our intuitive definition of “small". Therefore by using VI, we know what is called a “small” change of a biclustering result. The Normalized Mutual Information (NMI) shares the same idea as VI. The difference between them is that VI defines the distance between shared information and the totally information carried by each estimation while NMI does the same but in terms of ratio of shared information over the total information carried by each estimation. Like VI, NMI is an information theory based non-negative measure as well. If two biclustering estimations agree with each other, the shared information would be the same as the information carried by themselves, i.e.,N M I = 1, while N M I = 0 if they are totally against each other, i.e. no shared information. Thus NMI is between 0 and 1 as well like RI. Since the NMI is normalized, we can use it to compare different number of clusters as well and eliminate the problem that entropy tends to increase the number of biclusters. The detailed introduction about these measures and the additional five measures can be found in the Appendix A.3.

(52)

the total same size. Suppose the ground truth has four row clusters and four column clusters, sixteen biclusters in total. For each bicluster, we simulated the data byN(µrc, σ2), whereµrcis

the centroid for the bicluster located in rth row cluster and cth column cluster. Moreover, σ2 is the variation of each bicluster. The mean parameter µrc can be any numbers with relative

same distance for eachr and c. In our simulation study, we pick µrc to be a bunch of different

integers. The noise level is set to be a number in the setσ2 ∈ {1,50,100,500}. The locations of

outliers are randomly chosen. The values of the outliers is generated from a uniform distribution

U(4σ2₀,5σ₀2), whereσ₀2is the variance of the ground truth plus the variance within each bicluster. Figure 2.2 compared their performances when there are no contamination, i.e., that isC= 0. The sizes of row clusters are 100, 30,30 and 40, while the sizes of the column clusters are 50,

60,50and 40. Without outliers, we can see Sparse Biclustering, LEBRA and COBRA have the

(53)

same performance levels when the noise level is less than50. As we increased the noise level, the Sparse Biclustering approach did a slightly better job than LEBRA and COBRA did. Moreover, LEBRA is more effective than COBRA when the noise level is at 100 but less effective than SpBC under the same condition. But overall, based on this experiment, Sparse Biclustering is preferred when there are no outliers. A possible reason that Sparse Biclustering beats LEBRA and COBRA might be due to the maximum number of cluster selection parameters we have set in this simulation. The maximum number of row clusters and column clusters are set to be 10 for timing purposes. Theoretically it should be set up to the number of rows and the number

Figure 2.3: Row Size: 10,100,50,40; Column size: 10,100,50,40; No outliers

(54)

column clusters are10 100 50 and40 as well.

When there are outliers; that is C 6= 0, the story is slightly different. Suppose the true underlying checkerboard structure is still a 4-by-4 block structure. The row structure and column structure are the same as the conditions in using in Figure 2.2.

Figure 2.4: Row Size:100,30,30,40; Column size:60,60,60,50. The number of outliers:4000

(55)

can be found in Figure 2.5.

Figure 2.5: Row Size: 10, 100, 50, 40; Column size: 10, 100, 50, 40. The number of outliers: 4000

We can see the overall trend of even size case and not even size case are almost the same except the exact values are a bit different. COBRA in uneven tends to have worse values than it does in even case. The results of LEBRA is totally same and the results of SpBC are meaningless since it is broken when there are outliers.