CHAPTER 4. Novel Data Clustering Method Fuzzy-RW
4.5 Local PCA Induced Automatic Adaptive Clustering
The random walk induced distance is crucial for the success of the algorithm Fuzzy-RW because it is capable of representing the intrinsic geometric data structures. The “bandwidth” parameter σ used in defining the weight matrix
Wij = exp (−
d2(xi, xj)
κγ(x i, xj)σ
)
serves as a controlling factor of the random walk. Because it is clear to see that for d2(xi, xj)
smaller than σ, the probability of the random walk to travel between xi, xj is higher, and vice
versa. The induced random walk type of distances are highly affected by this bandwidth pa- rameter, so does the related clustering algorithm. The optimal choice of σ for a given dataset is an active area of research Coifman and Lafon (2006). In this section we describe a method for automatically refining the bandwidth parameter according to the local dataset structures. This type of modification makes the related clustering algorithms generate clusters that have tighter connections between neighboring data points.
To utilize the local neighborhood information around each data point, we first denote the s step neighbors of radius r around data point xi as Ns(xi, r) as defined in (4.19), xi ∈ Rp.
Then by performing PCA on the centered version of the set N (xi, r) = N1(xi, r)
[
N2(xi, r)
[
· · ·[Ns(xi, r)
we can find the principal components and the corresponding eigenvalues, denoted by {vki}pk=1 and {λik}pk=1, where λi1 ≥ λi
The principal components indicate the directions along which the variance of N (·, r) is maximized, and the corresponding eigenvalues can reflect the maximizing effects along differ- ent eigenvectors. Then the first few principal components {vik}m
k=1 (m ≤ p) can serve as a set
of basis with each vik indicating the direction of a coordinate. The origin of such coordinate system is the center of N (xi, r), denoted as x∗i. Then in order to generate clusters where neigh-
boring data points are “tied” closely together, we would like to shorten the distances in the first few principal directions {vik}m
k=1 (m ≤ p) by factors {λik}mk=1 respectively. This can be done
by first centering all data points such that x∗i is the origin in the new coordinate system, then shortening the distances along each vki with the factor λik, k = 1, 2, 3, · · · , m. Without loss of generality, we can use λik
Pm k=1λik
as the shortening factors. The method of shortening is described in the previous section, here we describe this process using the simple procedure performed in R2 with m = p = 2 (the cases of Rn and/or m < p can be generalized from here ).
For a fixed data point xi ∈ R2 in the given dataset X = {xk}Nk=1 , a given number of
step s and radius r, we can write Ns(xi, r) as the set of s step neighbors of xi. Then let
{vi
1, v2i} be the two orthonormal principal components of the set D = {y − x∗i | y ∈ N (xi, r)}
with associated eigenvalues are λi1, λi2, where x∗i is the center of N (xi, r) and N (xi, r) =
N1(xi, r)S N2(xi, r)S · · · S Ns(xi, r). For simplicity, here we use the notation v1 = vi1,
v2 = v2i, λ1 = λi1 and λ2 = λi2. To set up the new coordinate system, we can shift the
original dataset X to ˜X = { ˜xk | ˜xk = xk− xi∗ , k = 1, 2, · · · , N }. Let ˆλi be λ1λ+λi 2, i = 1, 2,
using the notation introduced in the previous section, we can define
V = [ ˆλ1v1 λˆ2v2] (4.22)
and C = V VT. Then the Mahalanobis distance ˆd between x
i (in X) and any data point xj (in
X) is defined through the following: ˆ
d2(xi, xj) = ( ˜xi− ˜xj)TC−1( ˜xi− ˜xj)
This modified distance can be used in defining the weight matrix Wij = exp (− ˆ d2(xi, xj) κγ(x i, xj)σ )
where γ is a parameter that decides the effect of the density term in the weight matrix. Even- tually, the above new weight matrix induces the random walk type of distances, which can be used in a standard clustering algorithm. Here, we still have to choose the “bandwidth” parameter σ, but the effect of σ can be modified by the shortening process represented in ˆd. Thus, the “bandwidth” is not a parameter that affects all data points uniformly. Instead, it is automatically adjustable through the above process.
Regularization on the above shortening procedure can be introduced to avoid extreme cases. We still use the dataset X ∈ R2 as an example. To avoid the cases where ˆλ1 is too large
compared to ˆλ2which makes the above shortening procedure too restricted in the first principal
component v1, we can introduce a controlling constant c when we define V in (4.22):
ˆ λ1= λ1 λ1+λ2 if λ1 λ2 ≤ c c c+1 if λ1 λ2 > c (4.23) ˆ λ2 can be computed as ˆλ2 = 1 − ˆλ1 in R2.
Here we take a dataset shown in Fig. 4.6. Fig. 4.6 (a) shows the clustering result by using Fuzzy-RW with commute distance and (b) shows the maximum membership value at each data point. Fig. 4.7 (a) shows the clustering result by using Fuzzy-RW and commute distance, incorporated with the local PCA and automatic adjustment described in this section. The related parameters are σ = 0.017, γ = 0, K = 1025 and the threshold is set to be 0.7. Fig.4.7 (b) shows the maximum membership value at each data point. The related parame- ters are s = 2, r = 0.06, a = 1.5, σ = 0.017, γ = 0, K = 1036, and the threshold is set to be 0.95.
It is clear from the figures that the local PCA adjustment makes the clustering algorithm generate clusters that have tight local connections. For some dataset whose clusters consist of data points scattering in the form of line segments, this type of adjustment is especially useful for detecting the segments.
(a)
(b)
(c)
Figure 4.5: (a). A dataset perturbed by noise datum. This dataset is used to demonstrate the technique of clustering with directional preference. (b). The clustering result obtained by specifying a directional preference and posing the threshold as 0.75 (see text for details). (c). The maximum membership values at each data point.
(a) (b)
Figure 4.6: (a). Clustering result derived by using Fuzzy-RW. Threshold is set to be 0.7. (b). Maximum membership values at each data point. (See text for the parameters involved.)
(a) (b)
Figure 4.7: (a). Clustering result derived by using Fuzzy-RW incorporated with local PCA. Threshold is set to be 0.95. (b). Maximum membership values at each data point. (See text for the parameters involved.)