Data Set 1: Forest Cover Type - Quality tests Real Data

3.4 Quality tests Real Data

3.4.1 Data Set 1: Forest Cover Type

Forest Data Cover data set has been used frequently in the stream clustering literature, in StreamKM++ [AMR+12] or Clustree [KABS11] among others. We will use it then in MOA to compare the algorithms. Table 16 contains a brief description of the dataset:

Real Data Set Name Forest Cover Type (KDD) Size 580K observations

Goal Cluster the 7 different types of forest trees found in the terrain

Description

The forest cover type for 30 x 30 meter cells obtained from US Forest Service (USFS) Region 2 Resource Information System (RIS) data. It contains 10 continuous variables: (Elevation, Aspect, Slope, Horizontal Distance To Hydrology, Verti- cal Distance To Hydrology Horizontal Distance To Roadways, Hillshade 9am , Hill- shade Noon , Hillshade 3pm , Horizontal Distance To Fire Points)

7 classes to cluster (forest cover types), in variable Cover Type desc: - Spruce/Fir - Lodgepole Pine - Ponderosa Pine - Cottonwood/Willow - Aspen - Douglas-fir - Krummholz Min classes79 2 Max classes80 7 Avg classes81 4.02 Location http://kdd.ics.uci.edu/databases/covertype/covertype.html

Table 16: Forest Covert Type data set details

We select settings for the data stream (default values decayHorizon = 1000, decayT hreshold = 0.01 , evaluationF requency = 1000) and normalize the data. Then execute the streaming of the data and notice again is that visualization in streaming scenarios is an important field to be improved upon. Thankfully, MOA provides some basic visualization in 2d modes, which is helpful, but surely techniques like parallel bars in streaming version would be very useful for this sort of investigation. We even considered developing a technique for this, but we realized that it would be far too ambitions for the scope of this work. We finally select some sets of informative dimensions and see that the classes are intertwined and broken in different pieces, which poses certainly a real challenge to the algorithms. Figure 50 shows four different 2-dimensional screen-shots in MOA of the execution of the streaming of the data set, at one specific instant. This gives an idea of what the data looks like. Upper left picture displays the attributes Elevation vs Aspect. Upper right Elevation vs Horizontal distance to hydrology. Bottom left Elevation vs Vertical distance to hydrology and finally bottom right Elevation vs Horizontal distance to roadways. We also identify four classes (four types of forest cover) present in the stream at that instant (class 1...class 4). On the right plots, We observe how class 1, class 2 and class 3 appear in the visualization in different places. That already indicates that creating unique non-overlapping clusters for each class will be difficult.

Figure 50: Forest Cover Type: visualizing the stream using four different sets of attributes

We also realize that the calculation of the ground-truth or true clustering done by MOA will not be straightforward as seen in Figure 51. MOA does this probably on-the-fly by capturing each class (all instances belonging to the same label) with the smallest possible hyper-spherical cluster. We should bear in mind that this step is necessary in order to calculate external clustering measures. We observe seven oversized and overlapping clusters (contour in black) covering each class label (forest cover type, i.e. Lodgepole Pine, Ponderosa Pine, etc). Quality measures could well be affected because of this.

Figure 51: Forest Cover Type: true clusters or ground-truth calculated on-the-fly by MOA

80_{Maximum number of different classes appearing simultaneously.} 81

Table 17 contains the parametrization chosen for the algorithms (horizon = 1000 for all): Algorithm StreamLeader

Default Param D M AX = 0.11 Optimal Param D M AX = 0.17

Algorithm Clustream

Default Param - M axN umKernels = 100

- kernelRadiF actor = 2

Optimal Param - M axN umKernels

82_{= 150}

- kernelRadiFactor=2

Algorithm Clustree Default Param maxHeight = 8 Optimal Param maxHeight = 14

Algorithm Denstream83 Default Param - epsilon = 0.02 - beta = 0.2 - mu = 1 - initP oints = 1000 - of f line = 2 - lambda = 0.25 - processingSpeed = 100 Optimal Param - epsilon = 0.16 - beta = 0.2 - mu = 1 - initP oints = 1000 - of f line = 2 - lambda = 0.25 - processingSpeed = 100

Table 17: Forest Covert Type: parametrization used for StreamLeader, Clustream, Denstream and Clustree. Figure 52 displays the clustering provided by StreamLeader (contour in red) and Clustream (contour in blue) for the instant shown in Figure 50. Attributes chosen for visualization are Elevation vs Horizontal distance to hydrology. Clustream produces as many clusters (seven) as ground-truth shows, probably because it receives from MOA the true number of clusters to discover, so this help is not to be expected in a real scenario. Still, they do not seem very consistent with the data. StreamLeader delivers four in optimal parametrization.

Figure 52: Forest Cover Type: clustering by StreamLeader (red) and Clustream (blue)

82_{While 100 micro-clusters achieves micro-ratio ≥ 10, we still opt to increase that ratio to achieve finer granularity.} 83

Finally we gather quality metrics for both default and optimal parametrization in Figures 53 (CMM and Silhouette Coef ), 54 (F1-P and F1-R), 55 (Homogeneity and Completeness), 56 (Rand Stat and Q AVG):

Figure 53: Forest Cover Type: CMM and Silhouette Coef default vs optimal parametrization

Figure 55: Forest Cover Type: Homogeneity and Completeness on default vs optimal parametrization

Together with the overall performance with both parametrizations in Figure 57:

Figure 57: Forest Cover Type: overall quality performance

According to the quality metrics, it is a very difficult data set to cluster in streaming. Homogeneity renders very low quality for all algorithms except StreamLeader, which still suffers from a quality drop around in the middle of the experiment but then recovers. Completeness on the other hand shows weak performance for StreamLeader, values even drop out of the plot. In general, the classes in general seem to be fragmented in different pieces in several locations, so the algorithms struggle to find them using hyper- spherical clusters. Another factor of uncertainty is the way MOA calculates ground-truth and compare it with final clustering to generate external quality clustering measures. All in all, StreamLeader outperforms again in average Clustream, Denstream and Clustree. Table 18 displays average quality values obtained with default and optimal parametrization. While Denstream crashed in MOA delivering metrics in default parametrization, it outperformed Clustream and Clustree using optimal parameters, which could not improve their own results in any of the two parametrizations.

Forest Cover Type

- FEW BIG CLUSTERS

SLeader CluSTR DenSTR CTree

DIM SPACE d=10

Default Param 0.26 0.27 crash 0.27 Optimal Param 0.37 0.27 0.36 0.27

In document Streaming Data Clustering in MOA using the Leader Algorithm (Page 91-97)