CHAPTER 3: BICLUSTERING USING SPARSE CLUSTERING AND
3.4 Real Data Application
3.4.2 Analysis of a Gene Expression Data Set
The data set used in this section contains gene expression measurements on 4751 genes
from tissue samples from a total number of 78 breast cancer patients. The time to metastases
of each subject is also available. See van’t Veer et al. (2002) for a more detailed description
of this data set.
The first biclusters identified bySCBiclustand LAS algorithms both contain exactly
the same 16 observations, but theSCBiclust
β
bicluster has 8 features whereas the LAS
bicluster has 1421 features (Table 3.15). (SCBiclust
U
found 3 of the same features as
SCBiclust
β
and one additional feature). The primary bicluster identified by the sparse
biclustering method contains 60 observations and 553 features. The HSSVD method
identified 8 mean bicluster layers and 3 variance bicluster layers, for which we will only
study the primary mean layer. The Plaid method failed to identify any biclusters within the
data set, and the SSVD method and the HSSVD variance identification did not produce
valid biclusters.
We tested the null hypothesis of no association between each putative bicluster and
metastases using log rank tests. Table 3.15 show the associations between metastases and the
biclusters identified by SCBiclust, HSSVD (mean layer only), LAS, and sparse biclustering.
A Kaplan-Meier plot is provided in Figure 3.19. The putative biclusters identified by
SCBiclust, LAS, and sparse biclustering were associated with time to metastases, but the
putative bicluster identified by HSSVD mean was not.
Table 3.15:
Gene expression: Comparison of biclustering and survival analysis results.
Algorithm
Obs.
Feature
Score (log-rank) test
Statistic (df)
p value
SCBiclustβ
16
8
11.11 (df=1)
0.0009
HSSVD mean
75
1046
0.42 (df=1)
0.5150
LAS
16
1421
11.11 (df=1)
0.0009
Sparse Biclustering
60
553
10.20 (df=1)
0.0014
0
50
100
150
0.0
0.2
0.4
0.6
0.8
1.0
Kaplan−Meier: SC−Biclust, LAS
BiCluster
Ref.
Hazard ratio: 3.27
Logrank test: 9e−04
0
50
100
150
0.0
0.2
0.4
0.6
0.8
1.0
Kaplan−Meier: sparseBC
BiCluster
Ref.
Hazard ratio: 0.33
Logrank test: 0.0014
0
50
100
150
0.0
0.2
0.4
0.6
0.8
1.0
Kaplan−Meier: HSSVD mean
BiCluster
Ref.
Hazard ratio: 0.62
Logrank test: 0.5148
Figure 3.19:
Breast cancer gene expression Kaplan-Meier plot.
The Kaplan-Meier plots showing the
association between time to metastases (months) and the biclusters identified by SCBiclust, LAS, and sparse
biclustering, and HSSVD mean.
3.5
Discussion
Biclustering is an unsupervised learning method that can be useful for uncovering
underlying data patterns in HDLSS data. In addition to identifying clusters of observations,
features responsible for the clusters are also identified. Uncovering the features responsible
for clustering may be especially important if one wishes to group additional data into pre-
identified clusters. In this paper we have proposed a biclustering method which extends
sparse clustering (Witten and Tibshirani, 2010) to also identify distinguishing features. The
method does not place any distributional constraints on the data or clusters and can be
used to identify both mean-based biclusters and more complex structures identified through
hierarchical clustering.
In simulation studies and real data analysis we illustrate that the proposed method
compares favorably with existing methods. SCBiclust tends to correctly identify biclusters
with high feature and observation accuracy. Also, unlike some biclustering methods such
as Plaid (Lazzeroni and Owen, 2002) and sparse biclustering (Tan and Witten, 2014), the
proposed method does not hinge upon the assumption that biclusters have the same mean.
We have shown in the hierarchical clustering example given in Simulation 5 that the method
can be adapted to incorporate other methods for identifying clusters. All that is required for
SCBiclust is a function which increases as a measure of the distance between the biclusters
and the remaining observations grows, and a method for maximizing this function with
respect to the observations.
We have proposed two ways for generating null weights for SCBiclust.
SCBiclust
β
makes use of distributional assumptions about feature weights assuming that the between
cluster sum of squares for features are uncorrelated. The other method,
SCBiclust
U
,
generates unimodal null data and determines features weights from clustering this null
data. We find that both methods generally produce comparable results.SCBiclust
U
may
be mores suited to find biclusters in non-normal settings, but may be more restrictive
about feature selection. Using the weights produced bySCBiclust
β
greatly reduces the
computation time. If possible we recommend using both methods and comparing the results
to identify biclusters in a data set.
Finally, SCBiclust can be modified to incorporate any cluster significance testing
method to be used as a stopping criteria for biclusters identification. Evaluating the number
of clusters, or biclusters, present in the data is an ongoing field of research so having a
method that can be flexible to advances in research is important. Currently we iteratively
employ the SigClust algorithm (Liu et al., 2008) to test the significance of each putative
bicluster. We chose this method for the present paper because of its accuracy in many
situations and its relatively short computation time. In our simulation studies we found
that the stopping criteria used by SCBiclust was generally more accurate than the methods
used by the other biclustering methods, but it may identify slightly more biclusters than are
present in the data. A future area of research includes modifying the criteria for generating
feature weights to also include a non-parametric test for cluster significance.
In this paper we have shown that SCBiclust performs well in terms of biclusters
identification and reproducibility in both simulation and real data. SCBiclust is able to
identify both biclusters that differ from the rest of the data in terms of feature means and
other complex structures that can be identified through hierarchical clustering. Future work
includes extending the method to identify biclusters that differ based on feature variance,
perhaps by extending the method to SVD based approaches. An additional avenue of
future research includes identifying network based biclusters such that observations in the
biclusters are more correlated than observations outside of the biclusters. These future
advances could further extend the application of SCBiclust in identifying subgroups and
distinguishing features in more complicated HDLSS data.
CHAPTER 4: PERMUTATION ASSOCIATION TESTING BETWEEN A SECONDARY
In document
Helgeson_unc_0153D_17193.pdf
(Page 114-118)