Synthetic Data: Eﬀectiveness - Efficient algorithms in analyzing genomic data

4.6 Experiments

4.6.1 Synthetic Data: Eﬀectiveness

We generated three synthetic data for this set of experiments.

• Syndata1: A 400×6 data matrix. Two nonlinear helix correlation clusters are embedded to resemble the synthetic data used in CURLER (Tung et al. (2005)). The clusters and nonlinear dependencies are listed in Table 4.2.

• Syndata2: A 100×100 data matrix. Two linear correlation clusters are embedded to resemble the synthetic data used in CARE (Zhang et al. (2008)). The clusters and linear dependencies are listed in Table 4.2.

• Syndata3: A 300×7 data matrix. This dataset has been used as the example in Figure 4.1. Two linear correlation clusters and one nonlinear correlation cluster are embedded. The three clusters overlap with each other on both features and objects. The clusters and linear/nonlinear dependencies are listed in Table 4.2.

Table 4.2: Correlation Clusters Embedded in Syndata1,2 and 3 Syndata1

Cluster Point Subset Feature Dependency 1 {s₁, ...., s₂₀₀} f₁ = 2·t, f₂ = 1.2·sin(t),

f₃ = 1.2·cos(t), t∈[0,6π] 2 {s₂₀₁, ...., s₄₀₀} f₄ =u, f₅ = 2·sin(u),

f₆ = 2·cos(u),u∈[0,6π] Syndata2

Cluster Point Subset Feature Dependency 1 {s₁, ...s₆₀} f₁₀=f₈+f₉

2 {s₄₁, ...s₁₀₀} f₅₀=f₄₈+f₄₉, Syndata3

Cluster Point Subset Feature Dependency 1 {s₁, ...s₁₀₀} f₂ =f₁+ 0.5·f₃

2 {s₆₁, ...s₁₆₀} f₇ =f₅+f₆, 3 {s₁₂₀, ...s₂₁₉} f₄ = 2·f₃·f₅, Syndata1

Since both embedded correlations are nonlinear, CARE found no clusters on Syndata1. The output of CURLER is a visualization of the clusters called NNCO plot (Tung et al. (2005)). The NNCO plot on Syndata1 is shown in Figure 4.8(a). The x-axis denotes the micro-clusters which are ordered according to the cluster merging procedure of CURLER. The y-axis denotes the co-sharing level of micro-clusters. In general, a cluster will be represented by a hill shape in the NNCO plot. The bars below the co- sharing plot represent the orientations of the micro-clusters. For micro-clusters in the same cluster, their orientations are similar and therefore, a block (or a pattern) in the corresponding bars can be observed graphically. Interested readers may refer to (Tung et al. (2005)) for details. Note that in the caption of Figure 4.8, term ’CURLER₄₀₀’ denotes that CURLER generated 400 micro-clusters on this dataset.

We can observe two hills in Figure 4.8(a), one from micro-clusters 1 to 200 and the other from micro-clusters 201 to 400. These two hills clearly indicate the existence of the two embedded nonlinear (helix) clusters.

Figure 4.8: Outputs of CURLER₄₀₀ and TreeNL on Syndata1

For TreeNL, we set gmin = 10 andf setmax = 3. And instead of outputting the top- K clusters (feature subsets), we output the correlation score for all enumerated clusters (feature subsets) for fair comparison.

Figure 4.8(b) plots the output of TreeNL on Syndata1, that is, the −C(F) score of each enumerated feature subset. The x-axis denotes the feature subsets in lexicographic order. The y-axis denotes the −C(F) score. Note that a higher point in the ﬁgure indicates a stronger correlation.

We can observe two peak points in Figure 4.8(b). The left peak represents feature subset {f₁, f₂, f₃}which corresponds to cluster 1, and the right peak represents feature subset {f₄, f₅, f₆} which corresponds to cluster 2. For each peak, there are two other points to the left which also indicate strong correlation. These points represent feature subsets{f₁, f₂},{f₁, f₃},{f₄, f₅}and{f₄, f₆}respectively. According to the dependency functions in Table 4.2, the correlations between these features are obvious.

On Syndata1, both CURLER and TreeNL found the embedded correlations while CARE failed.

Syndata2

CARE successfully detects the two embedded linear correlations and provides the quan- titative information of the dependencies,

• f₈+ 1.02·f₉−0.98·f₁₀= 0 • f₄₈+ 0.99·f₄₉−0.97·f₅₀= 0

(a) CURLER (b) TreeNL

cluster 1 cluster 2

Figure 4.9: Outputs of CURLER₁₀₀ and TreeNL on Syndata2

The NNCO plot of CURLER is shown in Figure 4.9(a). There are no obvious hills which can indicate the two embedded clusters. Both the high intrinsic dimensionality of the data (94% of the features are random noise) and the overlapping of data objects between the two clusters prevent CURLER from ﬁnding the clusters.

For TreeNL, we still use gmin = 10 andf setmax= 3. Figure 4.9(b) plots the output

of TreeNL on Syndata2. The two top points in Figure 4.9(b) represent feature subsets {f₈, f₉, f₁₀} and {f₄₈, f₄₉, f₅₀} which correspond to the two embedded clusters.

On Syndata2, both CARE and TreeNL found the embedded correlations while CURLER failed.

Syndata3

CARE didn’t ﬁnd the nonlinear correlation cluster. Since all three correlated feature subsets are supported by a minority of data objects (30%), CARE found only one linear correlation

• f₁−0.97·f₂+ 0.51·f₃ = 0

If we relax the parameters of CARE, e.g., lowering the minimum support threshold, the second correlation will then be found together with many other weakly correlated and spurious clusters.

Figure 4.10: Outputs of CURLER₁₀₀ and TreeNL on Syndata3

The NNCO plot of CURLER is shown in Figure 4.10(a). There are two small hills plotted in Figure 4.10(a). The micro-clusters corresponding to these hills contain parts of the objects in clusters 1 and 3. Because of the substantial overlapping between the embedded clusters, CURLER didn’t ﬁnd cluster 2 (see Table 4.2) and only found parts of clusters 1 and 3.

For TreeNL, we use the same setting, gmin = 10 and f setmax = 3. Figure 4.10(b)

{f₁, f₂, f₃}, {f₃, f₄, f₅} and {f₅, f₆, f₇} which correspond to the three embedded clusters. The data objects in each cluster returned by TreeNL are plotted in Figure 4.11. Compared with the embedded clusters shown in Figure4.1, we can see that TreeNL can discover both linear and nonlinear correlations very accurately.

Figure 4.11: Clusters of objects found by TreeNL on Syndata3

On Syndata3, TreeNL found all three embedded clusters while CARE and CURLER only found some of them.

In document Efficient algorithms in analyzing genomic data (Page 93-98)