Reducing microarray data via nonnegative matrix factorization for visualization and clustering analysis

(1)

Reducing microarray data via nonnegative matrix factorization

for visualization and clustering analysis

Weixiang Liu

*

_{, Kehong Yuan, Datian Ye}

Research Center of Biomedical Engineering, Life Science Division, Graduate School at Shenzhen, Tsinghua University, Shenzhen 518055, China Received 20 June 2007

Available online 23 December 2007

Abstract

In microarray data analysis, each gene expression sample has thousands of genes and reducing such high dimensionality is useful for both visualization and further clustering of samples. Traditional principal component analysis (PCA) is a commonly used method which has problems. Nonnegative Matrix Factorization (NMF) is a new dimension reduction method. In this paper we compare NMF and PCA for dimension reduction. The reduced data is used for visualization, and clustering analysis via k-means on 11 real gene expression datasets. Before the clustering analysis, we apply NMF and PCA for reduction in visualization. The results on one leukemia dataset show that NMF can discover natural clusters and clearly detect one mislabeled sample while PCA cannot. For clustering analysis via k-means, NMF most typically outperforms PCA. Our results demonstrate the superiority of NMF over PCA in reducing microarray data. Ó 2007 Elsevier Inc. All rights reserved.

Keywords: Microarray data; Nonnegative Matrix Factorization; Principal component analysis; Visualization; Clustering analysis

1. Introduction

Microarray data, or gene expression data, plays an

important role in understanding life [1]. Many topics

include revealing natural structures and identifying inter-esting patterns in the underlying data, which is helpful to understand gene function, gene regulation, cellular pro-cesses, subtypes of cells, and so on. However, microarrays always contain thousands of genes with relatively few sam-ples/conditions. In gene expression sample analysis, it is useful to reduce the data for visualization in 2-D or 3-D, and necessary for further clustering and classiﬁcation[1].

Traditional principal component analysis (PCA) is a commonly used method for reducing data for visualization and clustering analysis[2]. However previous results show that PCA seems not suitable for dimension reduction in further clustering gene expression [3]. Recently Nonnega-tive Matrix Factorization (NMF) has been proposed for

nonnegative data analysis. For example, nonnegativity makes the results of face images intuitive visually and

inter-pretable rationally [4]. Now NMF is starting to become

widely used in data analysis; e.g. see[5,6]for more informa-tion and references therein.

In bioinformatics area, NMF has also been proven a powerful method for microarray data analysis and protein sequence analysis[7]. However, in[8]NMF is used for clus-tering genes and in[9]a modiﬁed algorithm based on NMF is used for the same task; in[10]NMF is used for protein sequence recognition; in[11,12]NMF is only directly used for clustering gene expression samples with compared to hierarchical clustering and self-organizing maps; in [13]

NMF is used for biclustering of gene expression data. There are no results directly comparing PCA and NMF for reducing microarray data in visualization and cluster-ing analysis. In this paper, we compare PCA and NMF for dimension reduction of microarray gene expression data in visualization and clustering of samples with k-means method. Our experimental results show the superiority of NMF over PCA on microarray data. 1532-0464/$ - see front matterÓ 2007 Elsevier Inc. All rights reserved.

doi:10.1016/j.jbi.2007.12.003 *

Corresponding author.

E-mail address:[email protected](W. Liu).

www.elsevier.com/locate/yjbin Journal of Biomedical Informatics 41 (2008) 602–606

(2)

The rest of this paper is organized as below. In Section2

we brieﬂy introduce NMF and PCA and discuss their dif-ferences in microarray data analysis. After some experi-mental results shown in Section3, we conclude in Section

4.

2. NMF vs PCA for Reducing Microarray Data

Generally speaking, given a microarray dataset with n genes in m samples, there are two important aspects: n >> mholds while m is usually smaller than one hundred, and the expression value is always positive. In this section, we ﬁrst introduce PCA and NMF, and then discuss their diﬀerences in reducing microarray data.

2.1. PCA

Here we recall basic theory for PCA and discuss two methods for computing the principal components. See book [2] for more details and recent introductory article

on PCA and SVD[14].

Let X ¼ ½x1. . . xm denote a data set with m samples in

Rn_{. Further assume that all samples are zero mean, and this}

can be done by subtracting the mean value from the sam-ples, i.e. xi xi l where l ¼_m1P

m

i¼1xi. The empirical

covariance matrix of the sample set is

C¼ 1

mXX

T_: _ð1Þ

And this can be written, according to single value decom-position theorem, as

C¼ U KUT _ð2Þ

where K¼ diagðk1; . . . ;knÞ is the diagonal matrix with

or-dered eigenvalues k1P k2P . . . P kn, and U¼ ½u1. . . un

is the matrix whose columns are corresponding eigenvec-tors. For dimension reduction, we choose the ﬁrst rðr < nÞ eigenvectors according to the ﬁrst largest r eigen-values, and get the new feature vectors of one sample x as z¼ UT

rðx lÞ: ð3Þ

However, if m > n, it is time-consuming to get the eigenvec-tors ui from the large empirical covariance matrix. One

alternative way [14] is to ﬁrst calculate the inner product matrix

S ¼1

mX

T_{X ;} _ð4Þ

and it can be decomposed as

S ¼ V K0VT ð5Þ

where K0¼ diagðk01P k 0

2P . . . P k 0

mÞ is the diagonal

ma-trix with ordered eigenvalues k0₁P k0₂P . . . P k0_m. Then the original eigenvectors can be derived as

ui¼

1 ffiffiffiffiffiffiffi mk0_i

p Xvi; for k0i6¼ 0: ð6Þ

In fact, all nonzero eigenvalues k0_iare same to those kiwith

same order.

With this eigenvalue decomposition method, the num-ber of components via PCA always is limited by the mini-mum number between n and m.

2.2. NMF

Here we brieﬂy introduce the basic idea of NMF. Given a data set in matrix form X2Rnm

þ containing m samples

in n-dimension space, in which each of entry is nonnegative (i.e. xij P0 for all i; j), NMF is to ﬁnd an approximation as

X BH ð7Þ

where B is an n d matrix and H a d m matrix, and both B; H are also nonnegative. Each column of B can be sidered as a basis vector, and each column of H can be con-sidered as a new feature vector corresponding to the original data. See[4]for more information.

Basically, there exist two algorithms for completing the decomposition via implementing multiplicative updates as below[15] bik bik ðXHT_Þ ik ðBHHT_Þ ik hkj hkj ðBT_XÞ kj ðBT_BHÞ kj ð8Þ Table 1

Preprocessing strategy of datasets

Dataset no. Dataset name Value of

Floor Ceiling Max/min Max–min Std.

1 11_Tumors 100 16,000 5 500 N 2 14_Tumors 100 16,000 5 500 300 3 9_Tumors 100 16,000 N N N 4 Brain_Tumor1 20 16,000 5 500 N 5 Brain_Tumor2 20 16,000 4 500 N 6 Leukemia1 20 16,000 5 500 N 7 Leukemia2 100 16,000 5 500 500 8 Lung_Cancer 0 N N N 50 9 SRBCT N N N N N 10 Prostate_Tumor 20 16,000 5 N N 11 DLBCL 20 16,000 3 100 N

(3)

and bik bik P jhkj xij ðBHÞij P jhkj hkj hkj P ibik xij ðBHÞij P ibik : ð9Þ

These multiplicative update rules can hold nonnegativity easily with nonnegative initialization.

2.3. NMF vs PCA for reducing microarray data

Given a gene expression data, both NMF and PCA can give a solution for dimension reduction. However there are some diﬀerences between them:

For nonnegative data, the result of NMF provides easer interpretation than that of PCA [4]. In gene expression data analysis, the basis of PCA, i.e. the principal compo-nent containing negative values, has no physically inter-pretable meaning. Such a basis can be called an ‘‘eigengene”. However, the basis of NMF holding non-negativity, as called ‘‘metagene”, or ‘‘basis experiment”, reﬂects the coexpressed levels of genes. See [11,9,13] for more details.

Microarray data always contains values from thousands of genes ðnÞ in a few conditions ðmÞ, so n m holds. For PCA, the number of principal components is limited by the number of samples using the eigenvalue

decom-position method; while for NMF, the number of learned basis experiments is not so limited. It appears that NMF can derived more features than samples for further tering analysis, and this may be why it got higher clus-tering accuracy as shown in the experimental results. From computation point, we can see that PCA

essen-tially depends on empirical covariance matrix in which a large number of samples are necessary. However, in

Table 2

Cancer related human gene expression datasets Dataset no. Dataset name Number of

Types Samples Selected genes

1 11_Tumors 11 174 5797 2 14_Tumors 26 308 5304 3 9_Tumors 9 60 5726 4 Brain_Tumor1 5 90 5905 5 Brain_Tumor2 4 50 5607 6 Leukemia1 2/3 72 5230 7 Leukemia2 3 72 5420 8 Lung_Cancer 5 203 3312 9 SRBCT 4 83 2308 10 Prostate_Tumor 2 102 3722 11 DLBCL 2 77 5431

The dataset 6 (leukemia1) can be divided into two or three types. We consider the case of 2 groups for visualization and the case of 3 groups for clustering. –4 –2 0 2 4 6 8 x 104 –6 –4 –2 0 2 4 6 8x 10 4 PC1 PC2 Result of PCA Sample 66 1–ALL 2–AML 0 1 2 3 4 5 6 x 106 0 1 2 3 4 5 6 7x 10 6 BE1 BE2 Result of NMF Sample 66 1–ALL 2–AML

Fig. 1. Results of PCA vs NMF for reducing the leukemia data with 72 samples in visualization. Sample 66 is mislabeled. However in 2-D display, the reduced data by NMF can clearly show this mistake while that by PCA cannot demonstrate the wrong. ‘PC’ stands for principal component and ‘BE’ means basis experiment.

(4)

gene expression data, the number of samples is (always smaller than 100), much smaller than the number of genes (in thousand scale).

Finally, PCA is deterministic while NMF is stochastic; many variants of NMF have been proposed; see refer-ences in main text.

In summary, NMF appears to be more suitable for gene expression sample analysis than PCA; and our experimen-tal results on 11 real gene expression datasets also demon-strate the superiority of NMF.

3. Experimental results

In this section we tested NMF for reducing microarray data in visualization and clustering analysis via k-means approach. In following experiments, the iteration number for each run of NMF is set to 200 for convergence. 3.1. Datasets and preprocessing

We compared the discussed methods on 11 real cancer related gene expression data sets. All datasets in Matlab

format are available from http://www.gems-system.org/;

see [16] and references therein for more information. It

should be noted that all datasets but the SRBCT dataset, are produced by oligonucleotide-based technology. The dataset SRBCT was obtained via two-channel cDNA platform.

In order to reduce computation complexity, we further removed the genes which varied little across samples. See

Table 1for the strategy of preprocessing. Based on the ori-ginal preprocessing strategy of each dataset, we removed genes which vary little across samples, so that the number of genes remaining for each dataset is smaller than 6000. For each dataset, expression levels below the value of Floor were assigned a value of Floor, and those exceeding the value of Ceiling were assigned a value of Ceiling. After this ﬂoor and ceiling preprocessing, data is nonnegative and can be used for both PCA and NMF. The maximum and minimum values of a gene across samples were Max and Min. Max/Min means the ratio of Max to Min, and Max–Min means the diﬀerence of Max minus Min. Std. means the standard deviation value of a gene across sam-ples. Genes varying little across samples were removed according to (at most) three indexes: Max/Min, Max– Min, and Std. Furthermore, the number of groups, samples and selected gene for each dataset is shown inTable 2. For simplicity, we will use the dataset number instead of its name for further discussion.

0 50 100 150 200 0.6 0.8 1 Dimensionality Acuracy dataset–1 0 100 200 300 400 0 0.5 1 Dimensionality Acuracy dataset–2 0 20 40 60 80 0.4 0.6 0.8 Dimensionality Acuracy dataset–3 0 50 100 150 0.6 0.8 1 Dimensionality Acuracy dataset–4 0 20 40 60 80 0.7 0.8 0.9 1 Dimensionality Acuracy dataset–5 0 50 100 0.7 0.8 0.9 1 Dimensionality Acuracy dataset–6 0 50 100 0.8 0.85 0.9 Dimensionality Acuracy dataset–7 0 50 100 150 200 250 0.6 0.8 1 Dimensionality Acuracy dataset–8 0 50 100 0 0.5 1 Dimensionality Acuracy dataset–9 0 50 100 150 0.85 0.9 0.95 Dimensionality Acuracy dataset–10 0 50 100 0.7 0.8 0.9 Dimensionality Acuracy dataset–11 PCA NMF

(5)

3.2. Visualization

Here we consider the leukemia1 dataset (No. 6) which contains two groups: 47 ALL samples and 25 AML sam-ples. In fact, there is a mislabeled sample (No. 66) should

be an ALL sample; see e.g. [17]. As shown in Fig. 1,

NMF can easily detect the wrong labeled sample 66 while PCA cannot. In addition, the reduced data by NMF is more clearly separated than that by PCA.

3.3. Clustering analysis through k-means

Next we considered cluster analysis on each dataset using the reduced features, PCA scores or NMF scores. We applied the standard k-means method with Matlab via PCA and NMF. The distance measure for k-means is squared Euclidean distance. For PCA, the values of dimen-sionality for each dataset are set to f5; 10; 15; . . . ; ½m; mg

where m is the number of samples and½m means the

max-imum integer which can be divided by 5 and smaller than

m; when½m ¼ m, only one is used. For NMF, the reduced

dimensionality can be larger than the sample number, and then the values of dimensionality for each dataset are set to f5; 10; 15; . . . ; ½m; . . . ; ½m þ 20g. The average results over 100 repeated runs of k-means for each dimensionality are shown inFig. 2. And we also listed the mean accuracy over dimensionality on each dataset inTable 3. We can see that NMF almost always outperforms PCA, and this is consis-tent with previous report in[3]. In addition, NMF can get higher clustering accuracy for the situation with more fea-tures than the number of samples.

4. Discussion and conclusions

We compared NMF and PCA for reducing microarray data in visualization and clustering analysis through k-means method. NMF has two obvious advantages over PCA in microarray gene expression data analysis: (i) NMF holds nonnegativity of gene expression data, and (ii) NMF can derived features more than the number of samples. Our results on one leukemia dataset show that NMF can easily check the mislabeled sample via visualiza-tion. And the clustering results via k-means approach on 11 real gene expression datasets demonstrate that NMF out-performs PCA signiﬁcantly. Our results recommend that NMF can be used as dimension reduction for visualization and clustering analysis with contrast to PCA.

Acknowledgments

This paper was part of a project supported by Postdoc-toral Fund of China (No. 20070410536) and National Nat-ural Science Foundation of China (No. 30700824). The clustering accuracy is calculated based on some code of

Prof. Frank Dellaert’s clustering toolbox (http://

www.cc.gatech.edu/~dellaert/clusters.zip). The authors thank insightful comments from editor and both reviewers for improvement of the paper.

References

[1] Baldi P, Hatﬁeld G. DNA Microarrays and Gene Expression: from experiments to data analysis and modeling. Cambridge University Press; 2002.

[2] Jolliﬀe I. Principal component analysis. 2nd ed. Springer; 2002. [3] Yeung K, Ruzzo W. Principal component analysis for clustering gene

expression data. Bioinformatics 2001;17(9):763–74.

[4] Lee DD, Seung HS. Learning the parts of objects by non-negative matrix factorization. Nature 1999;401:788–91.

[5] Liu W, Zheng N, You Q. Nonnegative Matrix Factorization and its applications in pattern recognition. Chinese Sci Bull 2006;51(17–18):7–18.

[6] Langville A, Berry M, Browne M, Pauca V, Plemmons R. A survey of algorithms and applications for the Nonnegative Matrix Factoriza-tion, submitted to computational statistics and data analysis. URL http://www.cofc.edu/langvillea/CSDA.pdf.

[7] Pascual-Montano A, Carmona-Saez P, Chagoyen M, Tirado F, Carazo J, Pascual-Marqui R. bioNMF: a versatile tool for non-negative matrix factorization in biology. BMC Bioinform 2006;7(1):366.

[8] Kim P, Tidor B. Subsystem identiﬁcation through dimensionality reduction of large-scale gene expression data. Genome Res 2003;13:1706–18.

[9] Wang GL, Kossenkov AV, Michael FO. LS-NMF: a modiﬁed non-negative matrix factorization algorithm utilizing uncertainty esti-mates. BMC Bioinform 2006;7:175–96.

[10] Heger A, Holm L. Sensitive pattern discovery with ‘fuzzy’ alignments of distantly related proteins. Bioinformatics 2003;19(90001):130–7. [11] Brunet JP, Tamayo P, Golub TR, Mesirov JP. Metagenes and

molecular pattern discovery using matrix factorization. Proc Natl Acad Sci USA 2004;101(12):4164–9.

[12] Gao Y, Church G. Improving molecular cancer class discovery through sparse non-negative matrix factorization. Bioinformatics 2005;21:3970–5.

[13] Carmona-Saez P, Pascual-Marqui RD, Tirado F, et al. Biclustering of gene expression data by non-smooth non-negative matrix factor-ization. BMC Bioinform 2006;7:78–95.

[14] Wall M, Rechtsteiner A, Rocha L. Singular value decomposition and principal component analysis. In: Berrar MGDP, Dubitzky W, editors. A practical approach to microarray data analysis. Norwell, MA: Kluwer; 2003. p. 91–109.

[15] Lee DD, Seung HS. Algorithms for nonnegative matrix factorization. In: Leen TDT, Tresp V, editors. Advances in Neural Information Processing Systems, vol. 13. Cambridge, MA: MIT Press; 2000. p. 556–62.

[16] Statnikov A, Aliferis CF, Tsamardinos I, et al. A comprehensive evaluation of multicategory classiﬁcation methods for microarray gene expression cancer diagnosis. Bioinformatics 2006;21(5):631–43. [17] Malossini A, Blanzieri E, Ng R. Detecting potential labeling errors

in microarrays by data perturbation. Bioinformatics 2006;22(17):2114.

Table 3

Mean accuracy (%) over dimensionality on each real gene expression dataset

Method 1 2 3 4 5 6 7 8 9 10 11

PCA 70 40 51 56 70 79 85 60 55 85 65