Analysis of a Gene Expression Data Set

CHAPTER 3: BICLUSTERING USING SPARSE CLUSTERING AND

3.4 Real Data Application

3.4.2 Analysis of a Gene Expression Data Set

The data set used in this section contains gene expression measurements on 4751 genes

from tissue samples from a total number of 78 breast cancer patients. The time to metastases

of each subject is also available. See van’t Veer et al. (2002) for a more detailed description

of this data set.

The first biclusters identified bySCBiclustand LAS algorithms both contain exactly

the same 16 observations, but theSCBiclust

β

bicluster has 8 features whereas the LAS

bicluster has 1421 features (Table 3.15). (SCBiclust

U

found 3 of the same features as

SCBiclust

β

and one additional feature). The primary bicluster identified by the sparse

biclustering method contains 60 observations and 553 features. The HSSVD method

identified 8 mean bicluster layers and 3 variance bicluster layers, for which we will only

study the primary mean layer. The Plaid method failed to identify any biclusters within the

data set, and the SSVD method and the HSSVD variance identification did not produce

valid biclusters.

We tested the null hypothesis of no association between each putative bicluster and

metastases using log rank tests. Table 3.15 show the associations between metastases and the

biclusters identified by SCBiclust, HSSVD (mean layer only), LAS, and sparse biclustering.

A Kaplan-Meier plot is provided in Figure 3.19. The putative biclusters identified by

SCBiclust, LAS, and sparse biclustering were associated with time to metastases, but the

putative bicluster identified by HSSVD mean was not.

Table 3.15:

Gene expression: Comparison of biclustering and survival analysis results.

Algorithm

Obs.

Feature

Score (log-rank) test

Statistic (df)

p value

SCBiclustβ

16

8 11.11 (df=1)

0.0009

HSSVD mean

75 1046

0.42 (df=1)

0.5150

LAS

16 1421

11.11 (df=1)

0.0009

Sparse Biclustering

60

553 10.20 (df=1)

0.0014

0

50

100

150

0.0

0.2

0.4

0.6

0.8

1.0 Kaplan−Meier: SC−Biclust, LAS

BiCluster

Ref.

Hazard ratio: 3.27

Logrank test: 9e−04

0

50

100

150

0.0

0.2

0.4

0.6

0.8

1.0 Kaplan−Meier: sparseBC

BiCluster

Ref.

Hazard ratio: 0.33

Logrank test: 0.0014

0

50

100

150

0.0

0.2

0.4

0.6

0.8

1.0 Kaplan−Meier: HSSVD mean

BiCluster

Ref.

Hazard ratio: 0.62

Logrank test: 0.5148

Figure 3.19:

Breast cancer gene expression Kaplan-Meier plot.

The Kaplan-Meier plots showing the

association between time to metastases (months) and the biclusters identified by SCBiclust, LAS, and sparse

biclustering, and HSSVD mean.

3.5 Discussion

Biclustering is an unsupervised learning method that can be useful for uncovering

underlying data patterns in HDLSS data. In addition to identifying clusters of observations,

features responsible for the clusters are also identified. Uncovering the features responsible

for clustering may be especially important if one wishes to group additional data into pre-

identified clusters. In this paper we have proposed a biclustering method which extends

sparse clustering (Witten and Tibshirani, 2010) to also identify distinguishing features. The

method does not place any distributional constraints on the data or clusters and can be

used to identify both mean-based biclusters and more complex structures identified through

hierarchical clustering.

In simulation studies and real data analysis we illustrate that the proposed method

compares favorably with existing methods. SCBiclust tends to correctly identify biclusters

with high feature and observation accuracy. Also, unlike some biclustering methods such

as Plaid (Lazzeroni and Owen, 2002) and sparse biclustering (Tan and Witten, 2014), the

proposed method does not hinge upon the assumption that biclusters have the same mean.

We have shown in the hierarchical clustering example given in Simulation 5 that the method

can be adapted to incorporate other methods for identifying clusters. All that is required for

SCBiclust is a function which increases as a measure of the distance between the biclusters

and the remaining observations grows, and a method for maximizing this function with

respect to the observations.

We have proposed two ways for generating null weights for SCBiclust.

SCBiclust

β

makes use of distributional assumptions about feature weights assuming that the between

cluster sum of squares for features are uncorrelated. The other method,

SCBiclust

U

,

generates unimodal null data and determines features weights from clustering this null

data. We find that both methods generally produce comparable results.SCBiclust

U

may

be mores suited to find biclusters in non-normal settings, but may be more restrictive

about feature selection. Using the weights produced bySCBiclust

β

greatly reduces the

computation time. If possible we recommend using both methods and comparing the results

to identify biclusters in a data set.

Finally, SCBiclust can be modified to incorporate any cluster significance testing

method to be used as a stopping criteria for biclusters identification. Evaluating the number

of clusters, or biclusters, present in the data is an ongoing field of research so having a

method that can be flexible to advances in research is important. Currently we iteratively

employ the SigClust algorithm (Liu et al., 2008) to test the significance of each putative

bicluster. We chose this method for the present paper because of its accuracy in many

situations and its relatively short computation time. In our simulation studies we found

that the stopping criteria used by SCBiclust was generally more accurate than the methods

used by the other biclustering methods, but it may identify slightly more biclusters than are

present in the data. A future area of research includes modifying the criteria for generating

feature weights to also include a non-parametric test for cluster significance.

In this paper we have shown that SCBiclust performs well in terms of biclusters

identification and reproducibility in both simulation and real data. SCBiclust is able to

identify both biclusters that differ from the rest of the data in terms of feature means and

other complex structures that can be identified through hierarchical clustering. Future work

includes extending the method to identify biclusters that differ based on feature variance,

perhaps by extending the method to SVD based approaches. An additional avenue of

future research includes identifying network based biclusters such that observations in the

biclusters are more correlated than observations outside of the biclusters. These future

advances could further extend the application of SCBiclust in identifying subgroups and

distinguishing features in more complicated HDLSS data.

CHAPTER 4: PERMUTATION ASSOCIATION TESTING BETWEEN A SECONDARY

In document Helgeson_unc_0153D_17193.pdf (Page 114-118)