A modified smooth incremental clustering algorithm

Now we can modify Algorithm 5 applying Algorithm 2 in Step 3.

Algorithm 6. A modified smooth incremental algorithm for clustering.

Step 1. (Initialization). Compute the center x1∈Rn_{of the set A. Set l := 1.}

Step 2. (Stopping criterion). Set l := l + 1. If l > k then stop. The k-partition problem has been solved.

Step 3. (Computation a set of starting points for the next cluster center). Apply Algorithm 2 to compute the set ¯A₅defined by (4.1.7).

Step 4. (Computation a set of cluster centers). For each ¯y∈ ¯A₅take (x1, . . . , xl−1, ¯y) as a starting point and solve Problem (4.5.8). Let ( ˆy1, . . . , ˆyl) be a solution to this problem. Denote by ¯A₆a set of all such solutions.

Step 5. (Computation of the best solution). Compute

f_lmin= minnf_l( ˆy1, . . . , ˆyl) : ( ˆy1, . . . , ˆyl) ∈ ¯A₆ o

and the collection of cluster centers ( ¯y1, . . . , ¯yl) such that

f_l( ¯y1, . . . , ¯yl) = f_lmin

Step 6. (Solution to the l-partition problem). Set xj:= ¯yj, j = 1, . . . , l as a solution to the l-th partition problem and go to Step 2.

Chapter 5 Computational results: small data

sets

In this chapter we present and discuss computational results using small size data sets. All data sets contain only numeric attributes and they do not contain missing values. First, we give a brief description of data sets, then present results. These results include optimal values of the cluster function obtained by each algorithm and CPU time required by them. The following algorithms are used for comparison: the global k-means algorithm (GKM), the multi-start modified global k-means algorithm (MS-MGKM), the multi-start k-means algorithm (MS-KM), the difference of convex clustering algorithm (DCA), the clustering algorithm based on the difference of convex representation of the cluster function and nonsmooth optimization (DCClust) and two algorithms proposed in this thesis: the fast multi-start modified global k-means algorithm without weights (FMS-MGKM2) and with weights (FMS-MGKM). The description of these algorithms can be found in Chapter 4.

The number of starting points in the MS-KM algorithm is set to 500. Algorithms MS-MGKM, DCA and DCClust use the algorithm for computation of starting cluster centers described in the previous chapter. The implementation of FMS-MGKM and FMS-MGKM2 algorithms was also discussed in the previous chapter. CPU time in all tables are in seconds. In all data sets up to 25 clusters are computed.

5.1 Data sets

The brief description of small data sets is given in Table 5.1. In this table we include the number of instances (m, the number of attributes (n) and the total number of entries (Ne) for each data set. The number of instances in these data sets is less than

ten thousand and the number of attributes is ranging from 5 to 41.

Table 5.1: Small size data sets

No. Data sets m n Ne

1 Wilt 4839 5 24195

2 Wine Quality 4898 12 58776

3 Waveform Generator 5000 41 205000

4 Turkiye Student Evaluation 5820 28 162960

5 Drug data sets yprop 41 8885 22 195470

6 Combined Cycle Power Plant 9568 4 38272

7 Gesture Phase Segmentation 9900 18 178200

• Wilt is a data set which contains information of Root-wilt-disease (RWD), caused by phytoplasma, in Pine trees, in Jiangsu Province and collected by Remote sensing. This remote sensing study is using a multiscale object-based classification method for detecting diseased trees in high-resolution multi spectral satellite imagery. This data set has six attributes but only five attributes contributes to clustering as one attribute is categorical [13, 83].

• Wine Quality Data Set contains the information of two data sets, related to red and white vinho verde wine samples, from the north of Portugal. The aim is to model physicochemical tests [14, 83] to predict wine quality based. The eleven attributes which take part in physicochemical tests are: fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol.

• Waveform Database Generator data-set contains information of three classes of noise waves generated , each of which class wave is generated from a combination of 2 of 3 “base” waves. Each instance is generated frequency added noise (mean 0 and variance 1) in each attribute. This data set has 41

CHAPTER 5. COMPUTATIONAL RESULTS: SMALL DATA SETS 70

attributes but only 40 attributes take part in clustering experiments, as one attribute is categorical [15, 83].

• Turkiye Student Evaluation data-set consists of evaluation scores made avail- able by students of Gazi University in Ankara (Turkey). There are total 33 attributes but 28 are of course specific questions and additional 5 attributes are about named, instructor ID, class room number, number of repeat, attendance and number of difficulties. These additional are removed before clustering as these are categorical [16, 83].

• Drug data set “yes on Proposition 41” is abbreviated as yprop-41, is a cancer drug data set which gets a striking difference in the behavior of cancer-drug targets as compared with targets of non-cancer drugs, it has 267 attributes, all binary and categorical values are removed before data clustering [6].

• Combined Cycle Power Plant data set contains data points gathered from a Combined Cycle Power Plant (when the plant was tuned to work with full load), over 6 years (2006-2011). It has five attributes, hourly average ambient variables Temperature (T), Ambient Pressure (AP), Relative Humidity (RH) and Exhaust Vacuum (V) to predict the net hourly electrical energy output (EP) of the plant. The attribute related to the output is removed before the clustering experiments on plant input variable data sets [83].

• Gesture Phase Segmentation is a data set which is made up of features extracted from 7 videos with people suffering from gesticulating problems, the focus in this study is at Gesture Phase Segmentation. This data set contains 50 attributes divided into two files: 18 raw files and 32 processed files. We used only attributes with numeric values of the raw files. The input file data is extracted from 18 XML files which have eighteen features [21, 83].

In document Accurate and efficient clustering algorithms for very large data sets (Page 66-70)